PROJECT INTRODUCTION

In this project, using dataset public in HomeCredit competition for predictive default loan in credit risk. To understand better the background of this project, let's walk through this company profile and why they need to predict probability of default.

Home Credit is a global consumer finance provider focusing on responsible lending to underbanked populations, leveraging technology for access to financial services. In other to reduce credit loss where customers are not able to pay back the loan, one of the methods is assessing customer's probability of default before granting certain loan amount or credit limit, via authorized data on the applicant. In this scope of project, will focus on Cash Loan product, which is to predict probability of default on applicant applying for a certain loan amount.

For this problem of classification on whether or not customer will be default, supervised learning algorithm will be used with target to build main model using Logistic Regression. As requested in this scope of competition, AUC metric will be used to evaluate model's performance, with aim to achieve minimum 0.68. Additionally, AIC metric will also be used to measure model's goodness and complexity, in order to select most efficient model.

For feature selection criteria, IV and WoE will be used to measure feature's importance, as well as p-value for feature significant.

Source of data reference:
HOME CREDIT GROUP · FEATURED PREDICTION COMPETITION (2018) · https://www.kaggle.com/competitions/home-credit-default-risk/data

</font>

In [3]:
#import necessary tools
import pandas as pd
import numpy as np
pd.options.display.float_format = '{:.2f}'.format
pd.options.mode.chained_assignment = None
from sklearn.impute import SimpleImputer
import seaborn as sns
import matplotlib.pyplot as plt
In [4]:
#import data
path_desc = '/11. Machine Learning/11.1 Intro to ML Supervised Learning/final project tips/dataset for project/HC data/HomeCredit_columns_description.csv'
path_apptrain = '/11. Machine Learning/11.1 Intro to ML Supervised Learning/final project tips/dataset for project/HC data/application_train.csv'
path_apptest = '/11. Machine Learning/11.1 Intro to ML Supervised Learning/final project tips/dataset for project/HC data/application_test.csv'

data_desc = pd.read_csv(path_desc, encoding='cp1252')
data_apptrain = pd.read_csv(path_apptrain, encoding='cp1252')
data_apptest = pd.read_csv(path_apptest, encoding='cp1252')

1. DATA EXPLORATION

In [5]:
data_apptrain.shape
Out[5]:
(307511, 122)

As the original dataset contain 307511 rows and 122 columns, including both Cash Loans and Revolving Loans. In this scope of project, filter to use dataset of only cash loan product:

In [6]:
df = data_apptrain[data_apptrain['NAME_CONTRACT_TYPE']== 'Cash loans']
df.shape
Out[6]:
(278232, 122)

Remaining size of Cash Loans dataset is 278232 rows x 122 collumns will be used for data exploratory and model training and assessment onwards. For cross check and identify proper method during data cleaning process, data_apptest on Cash Loans with 48305 rows x 121 cols will also be used to double check of logic, but will not participate to model training.

1.1 Inspection on Cash Loan dataset¶

In this section, we will inspect data and focus on below step:

1. handling missing data;
2. handling incorrect datatype;
3. impute data inconsistency.
In [7]:
#check datatype of data_trainCL
set_dtype = set()
for i in range(len(df.columns)):
    set_dtype.add(df.iloc[:,i].dtypes)
    
print(set_dtype)
{dtype('O'), dtype('int64'), dtype('float64')}
In [8]:
#inspecting missing values and number of values
def missing_inspection(df):
    numeric_missing = {}
    cat_missing = {}
    
    for i in range(len(df.columns)):
        if df.iloc[:,i].dtypes == 'int64' or df.iloc[:,i].dtypes == 'float64':
            print (f'missing value in column {df.columns[i]} is {df.iloc[:,i].isnull().sum()/len(df)} \n')
            if df.iloc[:,i].isnull().sum() > 0:
                numeric_missing[df.columns[i]]=df.iloc[:,i].isnull().sum()
        else:
            print(f'number of missing values in this column is {df.iloc[:,i].isnull().sum()/len(df)} \n{df.iloc[:,i].value_counts(dropna = False)} \n')
            if df.iloc[:,i].isnull().sum() > 0:
                cat_missing[df.columns[i]]=df.iloc[:,i].isnull().sum()
    return print(f'Summary of missing data: {len(numeric_missing.keys())+len(cat_missing.keys())} cols has missing value, in which Numeric is {len(numeric_missing.keys())} cols and Category is {len(cat_missing.keys())} cols \n \n NUMERIC: \n {numeric_missing}, \n \n CATEGORY: \n {cat_missing}')
In [9]:
#checking characters of missing value in data_apptest
missing_inspection(data_apptest[data_apptest['NAME_CONTRACT_TYPE']=='Cash loans'])
missing value in column SK_ID_CURR is 0.0 

number of missing values in this column is 0.0 
Cash loans    48305
Name: NAME_CONTRACT_TYPE, dtype: int64 

number of missing values in this column is 0.0 
F    32409
M    15896
Name: CODE_GENDER, dtype: int64 

number of missing values in this column is 0.0 
N    32033
Y    16272
Name: FLAG_OWN_CAR, dtype: int64 

number of missing values in this column is 0.0 
Y    33295
N    15010
Name: FLAG_OWN_REALTY, dtype: int64 

missing value in column CNT_CHILDREN is 0.0 

missing value in column AMT_INCOME_TOTAL is 0.0 

missing value in column AMT_CREDIT is 0.0 

missing value in column AMT_ANNUITY is 0.0004968429769175033 

missing value in column AMT_GOODS_PRICE is 0.0 

number of missing values in this column is 0.018859331332160233 
Unaccompanied      39357
Family              5834
Spouse, partner     1437
NaN                  911
Children             405
Other_B              206
Other_A              108
Group of people       47
Name: NAME_TYPE_SUITE, dtype: int64 

number of missing values in this column is 0.0 
Working                 24371
Commercial associate    11246
Pensioner                9185
State servant            3501
Student                     2
Name: NAME_INCOME_TYPE, dtype: int64 

number of missing values in this column is 0.0 
Secondary / secondary special    33747
Higher education                 12330
Incomplete higher                 1713
Lower secondary                    474
Academic degree                     41
Name: NAME_EDUCATION_TYPE, dtype: int64 

number of missing values in this column is 0.0 
Married                 32002
Single / not married     6954
Civil marriage           4220
Separated                2938
Widow                    2191
Name: NAME_FAMILY_STATUS, dtype: int64 

number of missing values in this column is 0.0 
House / apartment      43238
With parents            2224
Municipal apartment     1602
Rented apartment         714
Office apartment         404
Co-op apartment          123
Name: NAME_HOUSING_TYPE, dtype: int64 

missing value in column REGION_POPULATION_RELATIVE is 0.0 

missing value in column DAYS_BIRTH is 0.0 

missing value in column DAYS_EMPLOYED is 0.0 

missing value in column DAYS_REGISTRATION is 0.0 

missing value in column DAYS_ID_PUBLISH is 0.0 

missing value in column OWN_CAR_AGE is 0.6631611634406376 

missing value in column FLAG_MOBIL is 0.0 

missing value in column FLAG_EMP_PHONE is 0.0 

missing value in column FLAG_WORK_PHONE is 0.0 

missing value in column FLAG_CONT_MOBILE is 0.0 

missing value in column FLAG_PHONE is 0.0 

missing value in column FLAG_EMAIL is 0.0 

number of missing values in this column is 0.3197805610185281 
NaN                      15447
Laborers                  8597
Sales staff               5040
Core staff                4313
Managers                  3519
Drivers                   2749
High skill tech staff     1833
Accountants               1614
Medicine staff            1312
Security staff             909
Cooking staff              888
Cleaning staff             655
Private service staff      452
Low-skill Laborers         269
Secretaries                212
Waiters/barmen staff       177
Realty agents              136
HR staff                   104
IT staff                    79
Name: OCCUPATION_TYPE, dtype: int64 

missing value in column CNT_FAM_MEMBERS is 0.0 

missing value in column REGION_RATING_CLIENT is 0.0 

missing value in column REGION_RATING_CLIENT_W_CITY is 0.0 

number of missing values in this column is 0.0 
TUESDAY      9661
WEDNESDAY    8371
MONDAY       8338
THURSDAY     8315
FRIDAY       7197
SATURDAY     4579
SUNDAY       1844
Name: WEEKDAY_APPR_PROCESS_START, dtype: int64 

missing value in column HOUR_APPR_PROCESS_START is 0.0 

missing value in column REG_REGION_NOT_LIVE_REGION is 0.0 

missing value in column REG_REGION_NOT_WORK_REGION is 0.0 

missing value in column LIVE_REGION_NOT_WORK_REGION is 0.0 

missing value in column REG_CITY_NOT_LIVE_CITY is 0.0 

missing value in column REG_CITY_NOT_WORK_CITY is 0.0 

missing value in column LIVE_CITY_NOT_WORK_CITY is 0.0 

number of missing values in this column is 0.0 
Business Entity Type 3    10744
XNA                        9185
Self-employed              5874
Other                      2681
Medicine                   1709
Government                 1493
Business Entity Type 2     1466
Trade: type 7              1293
School                     1278
Kindergarten               1035
Construction               1033
Business Entity Type 1      877
Transport: type 4           876
Trade: type 3               566
Military                    525
Industry: type 9            491
Industry: type 3            485
Security                    467
Transport: type 2           447
Police                      438
Housing                     432
Industry: type 11           415
Bank                        369
Security Ministries         338
Services                    298
Postal                      293
Agriculture                 292
Restaurant                  281
Trade: type 2               230
University                  218
Industry: type 7            215
Industry: type 1            176
Transport: type 3           174
Industry: type 4            166
Electricity                 155
Hotel                       132
Trade: type 6               118
Industry: type 5             96
Telecom                      94
Emergency                    90
Insurance                    77
Industry: type 12            76
Industry: type 2             76
Realtor                      72
Advertising                  70
Trade: type 1                63
Culture                      60
Legal Services               52
Mobile                       45
Cleaning                     41
Transport: type 1            34
Industry: type 6             27
Industry: type 10            24
Trade: type 4                14
Religion                     12
Trade: type 5                 9
Industry: type 13             6
Industry: type 8              2
Name: ORGANIZATION_TYPE, dtype: int64 

missing value in column EXT_SOURCE_1 is 0.4209916157747645 

missing value in column EXT_SOURCE_2 is 0.00014491253493427182 

missing value in column EXT_SOURCE_3 is 0.17780768036435152 

missing value in column APARTMENTS_AVG is 0.49065314149673944 

missing value in column BASEMENTAREA_AVG is 0.5675188903840183 

missing value in column YEARS_BEGINEXPLUATATION_AVG is 0.46966152572197495 

missing value in column YEARS_BUILD_AVG is 0.6531621985301729 

missing value in column COMMONAREA_AVG is 0.6874650657281854 

missing value in column ELEVATORS_AVG is 0.5174619604595797 

missing value in column ENTRANCES_AVG is 0.4844219024945658 

missing value in column FLOORSMAX_AVG is 0.4791429458648173 

missing value in column FLOORSMIN_AVG is 0.6665355553255357 

missing value in column LANDAREA_AVG is 0.5801676845047097 

missing value in column LIVINGAPARTMENTS_AVG is 0.672849601490529 

missing value in column LIVINGAREA_AVG is 0.4839250595176483 

missing value in column NONLIVINGAPARTMENTS_AVG is 0.6845047096573854 

missing value in column NONLIVINGAREA_AVG is 0.5357623434427078 

missing value in column APARTMENTS_MODE is 0.49065314149673944 

missing value in column BASEMENTAREA_MODE is 0.5675188903840183 

missing value in column YEARS_BEGINEXPLUATATION_MODE is 0.46966152572197495 

missing value in column YEARS_BUILD_MODE is 0.6531621985301729 

missing value in column COMMONAREA_MODE is 0.6874650657281854 

missing value in column ELEVATORS_MODE is 0.5174619604595797 

missing value in column ENTRANCES_MODE is 0.4844219024945658 

missing value in column FLOORSMAX_MODE is 0.4791429458648173 

missing value in column FLOORSMIN_MODE is 0.6665355553255357 

missing value in column LANDAREA_MODE is 0.5801676845047097 

missing value in column LIVINGAPARTMENTS_MODE is 0.672849601490529 

missing value in column LIVINGAREA_MODE is 0.4839250595176483 

missing value in column NONLIVINGAPARTMENTS_MODE is 0.6845047096573854 

missing value in column NONLIVINGAREA_MODE is 0.5357623434427078 

missing value in column APARTMENTS_MEDI is 0.49065314149673944 

missing value in column BASEMENTAREA_MEDI is 0.5675188903840183 

missing value in column YEARS_BEGINEXPLUATATION_MEDI is 0.46966152572197495 

missing value in column YEARS_BUILD_MEDI is 0.6531621985301729 

missing value in column COMMONAREA_MEDI is 0.6874650657281854 

missing value in column ELEVATORS_MEDI is 0.5174619604595797 

missing value in column ENTRANCES_MEDI is 0.4844219024945658 

missing value in column FLOORSMAX_MEDI is 0.4791429458648173 

missing value in column FLOORSMIN_MEDI is 0.6665355553255357 

missing value in column LANDAREA_MEDI is 0.5801676845047097 

missing value in column LIVINGAPARTMENTS_MEDI is 0.672849601490529 

missing value in column LIVINGAREA_MEDI is 0.4839250595176483 

missing value in column NONLIVINGAPARTMENTS_MEDI is 0.6845047096573854 

missing value in column NONLIVINGAREA_MEDI is 0.5357623434427078 

number of missing values in this column is 0.6730980229789877 
NaN                      32514
reg oper account         12001
reg oper spec account     1971
org spec account           911
not specified              908
Name: FONDKAPREMONT_MODE, dtype: int64 

number of missing values in this column is 0.4852706759134665 
block of flats      24404
NaN                 23441
specific housing      256
terraced house        204
Name: HOUSETYPE_MODE, dtype: int64 

missing value in column TOTALAREA_MODE is 0.46485871027843906 

number of missing values in this column is 0.4908394576130835 
NaN             23710
Panel           11147
Stone, brick    10334
Block            1418
Wooden            783
Mixed             351
Monolithic        283
Others            279
Name: WALLSMATERIAL_MODE, dtype: int64 

number of missing values in this column is 0.4563709760894317 
No     25906
NaN    22045
Yes      354
Name: EMERGENCYSTATE_MODE, dtype: int64 

missing value in column OBS_30_CNT_SOCIAL_CIRCLE is 2.0701790704895972e-05 

missing value in column DEF_30_CNT_SOCIAL_CIRCLE is 2.0701790704895972e-05 

missing value in column OBS_60_CNT_SOCIAL_CIRCLE is 2.0701790704895972e-05 

missing value in column DEF_60_CNT_SOCIAL_CIRCLE is 2.0701790704895972e-05 

missing value in column DAYS_LAST_PHONE_CHANGE is 0.0 

missing value in column FLAG_DOCUMENT_2 is 0.0 

missing value in column FLAG_DOCUMENT_3 is 0.0 

missing value in column FLAG_DOCUMENT_4 is 0.0 

missing value in column FLAG_DOCUMENT_5 is 0.0 

missing value in column FLAG_DOCUMENT_6 is 0.0 

missing value in column FLAG_DOCUMENT_7 is 0.0 

missing value in column FLAG_DOCUMENT_8 is 0.0 

missing value in column FLAG_DOCUMENT_9 is 0.0 

missing value in column FLAG_DOCUMENT_10 is 0.0 

missing value in column FLAG_DOCUMENT_11 is 0.0 

missing value in column FLAG_DOCUMENT_12 is 0.0 

missing value in column FLAG_DOCUMENT_13 is 0.0 

missing value in column FLAG_DOCUMENT_14 is 0.0 

missing value in column FLAG_DOCUMENT_15 is 0.0 

missing value in column FLAG_DOCUMENT_16 is 0.0 

missing value in column FLAG_DOCUMENT_17 is 0.0 

missing value in column FLAG_DOCUMENT_18 is 0.0 

missing value in column FLAG_DOCUMENT_19 is 0.0 

missing value in column FLAG_DOCUMENT_20 is 0.0 

missing value in column FLAG_DOCUMENT_21 is 0.0 

missing value in column AMT_REQ_CREDIT_BUREAU_HOUR is 0.12404512990373667 

missing value in column AMT_REQ_CREDIT_BUREAU_DAY is 0.12404512990373667 

missing value in column AMT_REQ_CREDIT_BUREAU_WEEK is 0.12404512990373667 

missing value in column AMT_REQ_CREDIT_BUREAU_MON is 0.12404512990373667 

missing value in column AMT_REQ_CREDIT_BUREAU_QRT is 0.12404512990373667 

missing value in column AMT_REQ_CREDIT_BUREAU_YEAR is 0.12404512990373667 

Summary of missing data: 64 cols has missing value, in which Numeric is 58 cols and Category is 6 cols 
 
 NUMERIC: 
 {'AMT_ANNUITY': 24, 'OWN_CAR_AGE': 32034, 'EXT_SOURCE_1': 20336, 'EXT_SOURCE_2': 7, 'EXT_SOURCE_3': 8589, 'APARTMENTS_AVG': 23701, 'BASEMENTAREA_AVG': 27414, 'YEARS_BEGINEXPLUATATION_AVG': 22687, 'YEARS_BUILD_AVG': 31551, 'COMMONAREA_AVG': 33208, 'ELEVATORS_AVG': 24996, 'ENTRANCES_AVG': 23400, 'FLOORSMAX_AVG': 23145, 'FLOORSMIN_AVG': 32197, 'LANDAREA_AVG': 28025, 'LIVINGAPARTMENTS_AVG': 32502, 'LIVINGAREA_AVG': 23376, 'NONLIVINGAPARTMENTS_AVG': 33065, 'NONLIVINGAREA_AVG': 25880, 'APARTMENTS_MODE': 23701, 'BASEMENTAREA_MODE': 27414, 'YEARS_BEGINEXPLUATATION_MODE': 22687, 'YEARS_BUILD_MODE': 31551, 'COMMONAREA_MODE': 33208, 'ELEVATORS_MODE': 24996, 'ENTRANCES_MODE': 23400, 'FLOORSMAX_MODE': 23145, 'FLOORSMIN_MODE': 32197, 'LANDAREA_MODE': 28025, 'LIVINGAPARTMENTS_MODE': 32502, 'LIVINGAREA_MODE': 23376, 'NONLIVINGAPARTMENTS_MODE': 33065, 'NONLIVINGAREA_MODE': 25880, 'APARTMENTS_MEDI': 23701, 'BASEMENTAREA_MEDI': 27414, 'YEARS_BEGINEXPLUATATION_MEDI': 22687, 'YEARS_BUILD_MEDI': 31551, 'COMMONAREA_MEDI': 33208, 'ELEVATORS_MEDI': 24996, 'ENTRANCES_MEDI': 23400, 'FLOORSMAX_MEDI': 23145, 'FLOORSMIN_MEDI': 32197, 'LANDAREA_MEDI': 28025, 'LIVINGAPARTMENTS_MEDI': 32502, 'LIVINGAREA_MEDI': 23376, 'NONLIVINGAPARTMENTS_MEDI': 33065, 'NONLIVINGAREA_MEDI': 25880, 'TOTALAREA_MODE': 22455, 'OBS_30_CNT_SOCIAL_CIRCLE': 1, 'DEF_30_CNT_SOCIAL_CIRCLE': 1, 'OBS_60_CNT_SOCIAL_CIRCLE': 1, 'DEF_60_CNT_SOCIAL_CIRCLE': 1, 'AMT_REQ_CREDIT_BUREAU_HOUR': 5992, 'AMT_REQ_CREDIT_BUREAU_DAY': 5992, 'AMT_REQ_CREDIT_BUREAU_WEEK': 5992, 'AMT_REQ_CREDIT_BUREAU_MON': 5992, 'AMT_REQ_CREDIT_BUREAU_QRT': 5992, 'AMT_REQ_CREDIT_BUREAU_YEAR': 5992}, 
 
 CATEGORY: 
 {'NAME_TYPE_SUITE': 911, 'OCCUPATION_TYPE': 15447, 'FONDKAPREMONT_MODE': 32514, 'HOUSETYPE_MODE': 23441, 'WALLSMATERIAL_MODE': 23710, 'EMERGENCYSTATE_MODE': 22045}
In [10]:
#comparing to characters of missing value in training set
missing_inspection(df)
missing value in column SK_ID_CURR is 0.0 

missing value in column TARGET is 0.0 

number of missing values in this column is 0.0 
Cash loans    278232
Name: NAME_CONTRACT_TYPE, dtype: int64 

number of missing values in this column is 0.0 
F    182800
M     95432
Name: CODE_GENDER, dtype: int64 

number of missing values in this column is 0.0 
N    183775
Y     94457
Name: FLAG_OWN_CAR, dtype: int64 

number of missing values in this column is 0.0 
Y    190207
N     88025
Name: FLAG_OWN_REALTY, dtype: int64 

missing value in column CNT_CHILDREN is 0.0 

missing value in column AMT_INCOME_TOTAL is 0.0 

missing value in column AMT_CREDIT is 0.0 

missing value in column AMT_ANNUITY is 4.312947468299836e-05 

missing value in column AMT_GOODS_PRICE is 0.0 

number of missing values in this column is 0.0036192817504816124 
Unaccompanied      224541
Family              36687
Spouse, partner     10431
Children             2965
Other_B              1592
NaN                  1007
Other_A               766
Group of people       243
Name: NAME_TYPE_SUITE, dtype: int64 

number of missing values in this column is 0.0 
Working                 142719
Commercial associate     63652
Pensioner                51993
State servant            19836
Student                     15
Unemployed                  15
Maternity leave              2
Name: NAME_INCOME_TYPE, dtype: int64 

number of missing values in this column is 0.0 
Secondary / secondary special    200125
Higher education                  65321
Incomplete higher                  9032
Lower secondary                    3608
Academic degree                     146
Name: NAME_EDUCATION_TYPE, dtype: int64 

number of missing values in this column is 0.0 
Married                 178711
Single / not married     39709
Civil marriage           26981
Separated                17846
Widow                    14985
Name: NAME_FAMILY_STATUS, dtype: int64 

number of missing values in this column is 0.0 
House / apartment      247389
With parents            12981
Municipal apartment     10237
Rented apartment         4290
Office apartment         2341
Co-op apartment           994
Name: NAME_HOUSING_TYPE, dtype: int64 

missing value in column REGION_POPULATION_RELATIVE is 0.0 

missing value in column DAYS_BIRTH is 0.0 

missing value in column DAYS_EMPLOYED is 0.0 

missing value in column DAYS_REGISTRATION is 0.0 

missing value in column DAYS_ID_PUBLISH is 0.0 

missing value in column OWN_CAR_AGE is 0.6605279047701199 

missing value in column FLAG_MOBIL is 0.0 

missing value in column FLAG_EMP_PHONE is 0.0 

missing value in column FLAG_WORK_PHONE is 0.0 

missing value in column FLAG_CONT_MOBILE is 0.0 

missing value in column FLAG_PHONE is 0.0 

missing value in column FLAG_EMAIL is 0.0 

number of missing values in this column is 0.31915811265418786 
NaN                      88800
Laborers                 50131
Sales staff              28534
Core staff               24374
Managers                 18739
Drivers                  17248
High skill tech staff    10008
Accountants               8607
Medicine staff            7740
Security staff            6178
Cooking staff             5437
Cleaning staff            4235
Private service staff     2359
Low-skill Laborers        1930
Waiters/barmen staff      1170
Secretaries               1169
Realty agents              675
HR staff                   478
IT staff                   420
Name: OCCUPATION_TYPE, dtype: int64 

missing value in column CNT_FAM_MEMBERS is 0.0 

missing value in column REGION_RATING_CLIENT is 0.0 

missing value in column REGION_RATING_CLIENT_W_CITY is 0.0 

number of missing values in this column is 0.0 
TUESDAY      49110
WEDNESDAY    47035
MONDAY       45954
THURSDAY     45815
FRIDAY       45383
SATURDAY     30495
SUNDAY       14440
Name: WEEKDAY_APPR_PROCESS_START, dtype: int64 

missing value in column HOUR_APPR_PROCESS_START is 0.0 

missing value in column REG_REGION_NOT_LIVE_REGION is 0.0 

missing value in column REG_REGION_NOT_WORK_REGION is 0.0 

missing value in column LIVE_REGION_NOT_WORK_REGION is 0.0 

missing value in column REG_CITY_NOT_LIVE_CITY is 0.0 

missing value in column REG_CITY_NOT_WORK_CITY is 0.0 

missing value in column LIVE_CITY_NOT_WORK_CITY is 0.0 

number of missing values in this column is 0.0 
Business Entity Type 3    60755
XNA                       52008
Self-employed             34518
Other                     15014
Medicine                  10118
Government                 9469
Business Entity Type 2     9363
School                     8067
Trade: type 7              7058
Kindergarten               6273
Construction               6084
Business Entity Type 1     5245
Transport: type 4          4925
Trade: type 3              3052
Industry: type 9           3022
Industry: type 3           2986
Security                   2955
Housing                    2715
Industry: type 11          2433
Military                   2379
Agriculture                2293
Bank                       2111
Police                     2086
Transport: type 2          2002
Postal                     1985
Security Ministries        1763
Restaurant                 1610
Trade: type 2              1599
Services                   1400
University                 1174
Industry: type 7           1164
Transport: type 3          1081
Industry: type 1            950
Hotel                       879
Electricity                 843
Industry: type 4            767
Trade: type 6               551
Industry: type 5            538
Telecom                     523
Insurance                   513
Emergency                   505
Industry: type 2            409
Advertising                 368
Realtor                     333
Culture                     330
Industry: type 12           323
Trade: type 1               300
Mobile                      268
Legal Services              259
Cleaning                    236
Transport: type 1           179
Industry: type 6             97
Industry: type 10            94
Religion                     79
Industry: type 13            59
Trade: type 4                54
Trade: type 5                46
Industry: type 8             22
Name: ORGANIZATION_TYPE, dtype: int64 

missing value in column EXT_SOURCE_1 is 0.5679253285028322 

missing value in column EXT_SOURCE_2 is 0.0021528796112596682 

missing value in column EXT_SOURCE_3 is 0.19644756893527704 

missing value in column APARTMENTS_AVG is 0.5118318525547025 

missing value in column BASEMENTAREA_AVG is 0.5892708243480261 

missing value in column YEARS_BEGINEXPLUATATION_AVG is 0.4921720003450358 

missing value in column YEARS_BUILD_AVG is 0.6684745104804624 

missing value in column COMMONAREA_AVG is 0.7017021766008223 

missing value in column ELEVATORS_AVG is 0.5372890249863423 

missing value in column ENTRANCES_AVG is 0.5079286350958911 

missing value in column FLOORSMAX_AVG is 0.5019875499583082 

missing value in column FLOORSMIN_AVG is 0.6818554300008626 

missing value in column LANDAREA_AVG is 0.5978176485810403 

missing value in column LIVINGAPARTMENTS_AVG is 0.6867039017798097 

missing value in column LIVINGAREA_AVG is 0.506113603036315 

missing value in column NONLIVINGAPARTMENTS_AVG is 0.697453923344547 

missing value in column NONLIVINGAREA_AVG is 0.5560179993674343 

missing value in column APARTMENTS_MODE is 0.5118318525547025 

missing value in column BASEMENTAREA_MODE is 0.5892708243480261 

missing value in column YEARS_BEGINEXPLUATATION_MODE is 0.4921720003450358 

missing value in column YEARS_BUILD_MODE is 0.6684745104804624 

missing value in column COMMONAREA_MODE is 0.7017021766008223 

missing value in column ELEVATORS_MODE is 0.5372890249863423 

missing value in column ENTRANCES_MODE is 0.5079286350958911 

missing value in column FLOORSMAX_MODE is 0.5019875499583082 

missing value in column FLOORSMIN_MODE is 0.6818554300008626 

missing value in column LANDAREA_MODE is 0.5978176485810403 

missing value in column LIVINGAPARTMENTS_MODE is 0.6867039017798097 

missing value in column LIVINGAREA_MODE is 0.506113603036315 

missing value in column NONLIVINGAPARTMENTS_MODE is 0.697453923344547 

missing value in column NONLIVINGAREA_MODE is 0.5560179993674343 

missing value in column APARTMENTS_MEDI is 0.5118318525547025 

missing value in column BASEMENTAREA_MEDI is 0.5892708243480261 

missing value in column YEARS_BEGINEXPLUATATION_MEDI is 0.4921720003450358 

missing value in column YEARS_BUILD_MEDI is 0.6684745104804624 

missing value in column COMMONAREA_MEDI is 0.7017021766008223 

missing value in column ELEVATORS_MEDI is 0.5372890249863423 

missing value in column ENTRANCES_MEDI is 0.5079286350958911 

missing value in column FLOORSMAX_MEDI is 0.5019875499583082 

missing value in column FLOORSMIN_MEDI is 0.6818554300008626 

missing value in column LANDAREA_MEDI is 0.5978176485810403 

missing value in column LIVINGAPARTMENTS_MEDI is 0.6867039017798097 

missing value in column LIVINGAREA_MEDI is 0.506113603036315 

missing value in column NONLIVINGAPARTMENTS_MEDI is 0.697453923344547 

missing value in column NONLIVINGAREA_MEDI is 0.5560179993674343 

number of missing values in this column is 0.6872322378446764 
NaN                      191210
reg oper account          66104
reg oper spec account     10854
not specified              5086
org spec account           4978
Name: FONDKAPREMONT_MODE, dtype: int64 

number of missing values in this column is 0.5061639207567785 
NaN                 140831
block of flats      134949
specific housing      1355
terraced house        1097
Name: HOUSETYPE_MODE, dtype: int64 

missing value in column TOTALAREA_MODE is 0.48705396934932 

number of missing values in this column is 0.5126728773110211 
NaN             142642
Panel            59340
Stone, brick     58088
Block             8266
Wooden            4851
Mixed             2048
Monolithic        1537
Others            1460
Name: WALLSMATERIAL_MODE, dtype: int64 

number of missing values in this column is 0.4784101037982691 
No     143014
NaN    133109
Yes      2109
Name: EMERGENCYSTATE_MODE, dtype: int64 

missing value in column OBS_30_CNT_SOCIAL_CIRCLE is 3.5941228902498635e-06 

missing value in column DEF_30_CNT_SOCIAL_CIRCLE is 3.5941228902498635e-06 

missing value in column OBS_60_CNT_SOCIAL_CIRCLE is 3.5941228902498635e-06 

missing value in column DEF_60_CNT_SOCIAL_CIRCLE is 3.5941228902498635e-06 

missing value in column DAYS_LAST_PHONE_CHANGE is 3.5941228902498635e-06 

missing value in column FLAG_DOCUMENT_2 is 0.0 

missing value in column FLAG_DOCUMENT_3 is 0.0 

missing value in column FLAG_DOCUMENT_4 is 0.0 

missing value in column FLAG_DOCUMENT_5 is 0.0 

missing value in column FLAG_DOCUMENT_6 is 0.0 

missing value in column FLAG_DOCUMENT_7 is 0.0 

missing value in column FLAG_DOCUMENT_8 is 0.0 

missing value in column FLAG_DOCUMENT_9 is 0.0 

missing value in column FLAG_DOCUMENT_10 is 0.0 

missing value in column FLAG_DOCUMENT_11 is 0.0 

missing value in column FLAG_DOCUMENT_12 is 0.0 

missing value in column FLAG_DOCUMENT_13 is 0.0 

missing value in column FLAG_DOCUMENT_14 is 0.0 

missing value in column FLAG_DOCUMENT_15 is 0.0 

missing value in column FLAG_DOCUMENT_16 is 0.0 

missing value in column FLAG_DOCUMENT_17 is 0.0 

missing value in column FLAG_DOCUMENT_18 is 0.0 

missing value in column FLAG_DOCUMENT_19 is 0.0 

missing value in column FLAG_DOCUMENT_20 is 0.0 

missing value in column FLAG_DOCUMENT_21 is 0.0 

missing value in column AMT_REQ_CREDIT_BUREAU_HOUR is 0.13384154231001466 

missing value in column AMT_REQ_CREDIT_BUREAU_DAY is 0.13384154231001466 

missing value in column AMT_REQ_CREDIT_BUREAU_WEEK is 0.13384154231001466 

missing value in column AMT_REQ_CREDIT_BUREAU_MON is 0.13384154231001466 

missing value in column AMT_REQ_CREDIT_BUREAU_QRT is 0.13384154231001466 

missing value in column AMT_REQ_CREDIT_BUREAU_YEAR is 0.13384154231001466 

Summary of missing data: 65 cols has missing value, in which Numeric is 59 cols and Category is 6 cols 
 
 NUMERIC: 
 {'AMT_ANNUITY': 12, 'OWN_CAR_AGE': 183780, 'EXT_SOURCE_1': 158015, 'EXT_SOURCE_2': 599, 'EXT_SOURCE_3': 54658, 'APARTMENTS_AVG': 142408, 'BASEMENTAREA_AVG': 163954, 'YEARS_BEGINEXPLUATATION_AVG': 136938, 'YEARS_BUILD_AVG': 185991, 'COMMONAREA_AVG': 195236, 'ELEVATORS_AVG': 149491, 'ENTRANCES_AVG': 141322, 'FLOORSMAX_AVG': 139669, 'FLOORSMIN_AVG': 189714, 'LANDAREA_AVG': 166332, 'LIVINGAPARTMENTS_AVG': 191063, 'LIVINGAREA_AVG': 140817, 'NONLIVINGAPARTMENTS_AVG': 194054, 'NONLIVINGAREA_AVG': 154702, 'APARTMENTS_MODE': 142408, 'BASEMENTAREA_MODE': 163954, 'YEARS_BEGINEXPLUATATION_MODE': 136938, 'YEARS_BUILD_MODE': 185991, 'COMMONAREA_MODE': 195236, 'ELEVATORS_MODE': 149491, 'ENTRANCES_MODE': 141322, 'FLOORSMAX_MODE': 139669, 'FLOORSMIN_MODE': 189714, 'LANDAREA_MODE': 166332, 'LIVINGAPARTMENTS_MODE': 191063, 'LIVINGAREA_MODE': 140817, 'NONLIVINGAPARTMENTS_MODE': 194054, 'NONLIVINGAREA_MODE': 154702, 'APARTMENTS_MEDI': 142408, 'BASEMENTAREA_MEDI': 163954, 'YEARS_BEGINEXPLUATATION_MEDI': 136938, 'YEARS_BUILD_MEDI': 185991, 'COMMONAREA_MEDI': 195236, 'ELEVATORS_MEDI': 149491, 'ENTRANCES_MEDI': 141322, 'FLOORSMAX_MEDI': 139669, 'FLOORSMIN_MEDI': 189714, 'LANDAREA_MEDI': 166332, 'LIVINGAPARTMENTS_MEDI': 191063, 'LIVINGAREA_MEDI': 140817, 'NONLIVINGAPARTMENTS_MEDI': 194054, 'NONLIVINGAREA_MEDI': 154702, 'TOTALAREA_MODE': 135514, 'OBS_30_CNT_SOCIAL_CIRCLE': 1, 'DEF_30_CNT_SOCIAL_CIRCLE': 1, 'OBS_60_CNT_SOCIAL_CIRCLE': 1, 'DEF_60_CNT_SOCIAL_CIRCLE': 1, 'DAYS_LAST_PHONE_CHANGE': 1, 'AMT_REQ_CREDIT_BUREAU_HOUR': 37239, 'AMT_REQ_CREDIT_BUREAU_DAY': 37239, 'AMT_REQ_CREDIT_BUREAU_WEEK': 37239, 'AMT_REQ_CREDIT_BUREAU_MON': 37239, 'AMT_REQ_CREDIT_BUREAU_QRT': 37239, 'AMT_REQ_CREDIT_BUREAU_YEAR': 37239}, 
 
 CATEGORY: 
 {'NAME_TYPE_SUITE': 1007, 'OCCUPATION_TYPE': 88800, 'FONDKAPREMONT_MODE': 191210, 'HOUSETYPE_MODE': 140831, 'WALLSMATERIAL_MODE': 142642, 'EMERGENCYSTATE_MODE': 133109}

Base on data description provided by source, as well as crosschecking on testset, together with above summary of missing value and data inconsistency, removing NA or imputation will process as following:

  • inspect on 1 data missing on _SOCIAL_CIRCLE and DAYS_LAST_PHONE_CHANGE, to remove if it is technical issue in data gathering by the source (not appear in testset), otherwise impute to appropriate value.
  • for normalized data: missing value will be replace by -1
  • for days-to-extract-time data: > 0 is inconsistent data. If x > 0 or missing, impute by 1
  • for continuous data of currency amount: impute by mean value.
  • for continuous data of other type: inspect data detail to consider mean or -1 value to be applied.
  • for category data: denote missing value group to "unknown"

1.2 processing on single data missing¶

In [11]:
#associated on 1 record, temporary to impute by -1, later when perform data analysis, will consider to update this into proper value
df['OBS_30_CNT_SOCIAL_CIRCLE'].fillna(-1,inplace = True)
df['DEF_30_CNT_SOCIAL_CIRCLE'].fillna(-1,inplace = True)
df['OBS_60_CNT_SOCIAL_CIRCLE'].fillna(-1,inplace = True)
df['DEF_60_CNT_SOCIAL_CIRCLE'].fillna(-1,inplace = True)
In [12]:
#technical issue, to drop this record
df.dropna(subset=['DAYS_LAST_PHONE_CHANGE'], inplace=True)

1.3 processing on numeric data: normalized and days¶

In [13]:
numeric_missing = []
cat_missing = []
    
for i in range(len(df.columns)):
    if df.iloc[:,i].dtypes == 'int64' or df.iloc[:,i].dtypes == 'float64':
        if df.iloc[:,i].isnull().sum() > 0:
            numeric_missing.append(df.columns[i])
    else:
        if df.iloc[:,i].isnull().sum() > 0:
            cat_missing.append(df.columns[i])
In [14]:
#impute numeric value: missing value
for col in numeric_missing:
    if col in data_desc[data_desc['Special'].isin(['normalized', 'normalized '])]['Row'].tolist():
        df[col].fillna(-1,inplace=True) 
    elif col in data_desc[data_desc['Special']=='time only relative to the application']['Row'].tolist():
        df[col].fillna(1,inplace = True)
In [15]:
#impute numeric value: inconsistent value
#days count to application time should be less than 0 (past data), thus impute to 1 to remark invalid data to same group of missing data
for col in numeric_missing:
    if col in data_desc[data_desc['Special']=='time only relative to the application']['Row'].tolist() and len(df.loc[df[col] > 0]) > 0:
        df[col] = df[col].apply(lambda x: x if x < 0 else 1)

1.4 processing on continuous data of currency amount¶

let's have a look on 12 missing record of AMT_ANNUITY:

In [16]:
df[df['AMT_ANNUITY'].isnull()]
Out[16]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
47531 155054 0 Cash loans M N N 0 180000.00 450000.00 NaN ... 0 0 0 0 0.00 0.00 0.00 0.00 1.00 1.00
50035 157917 0 Cash loans F N N 0 94500.00 450000.00 NaN ... 0 0 0 0 0.00 0.00 0.00 1.00 0.00 1.00
51594 159744 0 Cash loans F N N 0 202500.00 539100.00 NaN ... 0 0 0 0 0.00 0.00 0.00 1.00 0.00 1.00
55025 163757 0 Cash loans F N N 0 162000.00 296280.00 NaN ... 0 0 0 0 0.00 0.00 0.00 1.00 0.00 4.00
59934 169487 0 Cash loans M Y N 0 202500.00 360000.00 NaN ... 0 0 0 0 0.00 0.00 1.00 0.00 0.00 6.00
75873 187985 0 Cash loans M Y N 0 144000.00 219249.00 NaN ... 0 0 0 0 0.00 0.00 0.00 0.00 0.00 11.00
89343 203726 0 Cash loans F Y N 0 90000.00 157500.00 NaN ... 0 0 0 0 0.00 0.00 0.00 2.00 0.00 0.00
123872 243648 0 Cash loans F N Y 0 202500.00 929088.00 NaN ... 0 0 0 0 0.00 0.00 0.00 0.00 1.00 5.00
207186 340147 0 Cash loans M N N 0 171000.00 486000.00 NaN ... 0 0 0 0 0.00 0.00 1.00 1.00 0.00 2.00
227939 364022 0 Cash loans F N Y 0 315000.00 628069.50 NaN ... 0 0 0 0 0.00 0.00 1.00 0.00 0.00 2.00
239329 377174 0 Cash loans F N Y 0 157500.00 792000.00 NaN ... 0 0 0 0 0.00 0.00 0.00 0.00 1.00 1.00
241835 379997 0 Cash loans F N N 0 315000.00 1483231.50 NaN ... 0 0 0 0 NaN NaN NaN NaN NaN NaN

12 rows × 122 columns

as Amount of Annuity is mandatory data, this missing can be technical issue during data gathering by the source, to impute by mean value.

In [17]:
df['AMT_ANNUITY'].fillna(df['AMT_ANNUITY'].mean(),inplace = True)

1.5 processing on continuous data of other type¶

inspect data of OWN_CAR_AGE to find proper impute value

In [18]:
df['OWN_CAR_AGE'].describe()
Out[18]:
count   94451.00
mean       12.02
std        11.65
min         0.00
25%         5.00
50%         9.00
75%        15.00
max        91.00
Name: OWN_CAR_AGE, dtype: float64

OWN_CAR_AGE has large missing value, presenting age or the own car, the larger of value the older of car, therefore the missing value will be denoted -1 to distinguish with 0 as new car.

In [19]:
df['OWN_CAR_AGE'].fillna(-1,inplace = True)

on remaining numeric data of AMT_REQ_CREDIT_BUREAU_time, missing data means there is no information, which is different from requesting data and showing 0 early request. Therefore, missing data of these fields will be denoted as -1

In [20]:
df['AMT_REQ_CREDIT_BUREAU_HOUR'].fillna(-1,inplace = True)
df['AMT_REQ_CREDIT_BUREAU_DAY'].fillna(-1,inplace = True)
df['AMT_REQ_CREDIT_BUREAU_WEEK'].fillna(-1,inplace = True)
df['AMT_REQ_CREDIT_BUREAU_MON'].fillna(-1,inplace = True)
df['AMT_REQ_CREDIT_BUREAU_QRT'].fillna(-1,inplace = True)
df['AMT_REQ_CREDIT_BUREAU_YEAR'].fillna(-1,inplace = True)

1.6 processing on category data¶

In [21]:
for cat in cat_missing:
    df[cat].fillna('unknown', inplace = True)
In [22]:
for i in cat_missing:
    print(df[i].value_counts())
    print('\n')
Unaccompanied      224540
Family              36687
Spouse, partner     10431
Children             2965
Other_B              1592
unknown              1007
Other_A               766
Group of people       243
Name: NAME_TYPE_SUITE, dtype: int64


unknown                  88799
Laborers                 50131
Sales staff              28534
Core staff               24374
Managers                 18739
Drivers                  17248
High skill tech staff    10008
Accountants               8607
Medicine staff            7740
Security staff            6178
Cooking staff             5437
Cleaning staff            4235
Private service staff     2359
Low-skill Laborers        1930
Waiters/barmen staff      1170
Secretaries               1169
Realty agents              675
HR staff                   478
IT staff                   420
Name: OCCUPATION_TYPE, dtype: int64


unknown                  191209
reg oper account          66104
reg oper spec account     10854
not specified              5086
org spec account           4978
Name: FONDKAPREMONT_MODE, dtype: int64


unknown             140831
block of flats      134948
specific housing      1355
terraced house        1097
Name: HOUSETYPE_MODE, dtype: int64


unknown         142642
Panel            59339
Stone, brick     58088
Block             8266
Wooden            4851
Mixed             2048
Monolithic        1537
Others            1460
Name: WALLSMATERIAL_MODE, dtype: int64


No         143013
unknown    133109
Yes          2109
Name: EMERGENCYSTATE_MODE, dtype: int64


1.7 transform boolean value from category to numeric datatype.¶

In [23]:
#before transform:

boolean_list = ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'EMERGENCYSTATE_MODE']
for b in boolean_list:
    print(df[b].value_counts())
    print('\n')
F    182800
M     95431
Name: CODE_GENDER, dtype: int64


N    183775
Y     94456
Name: FLAG_OWN_CAR, dtype: int64


Y    190206
N     88025
Name: FLAG_OWN_REALTY, dtype: int64


No         143013
unknown    133109
Yes          2109
Name: EMERGENCYSTATE_MODE, dtype: int64


In [24]:
#transform:

df['CODE_GENDER'] = df['CODE_GENDER'].map({'F':1,'M':0})
df['FLAG_OWN_CAR'] = df['FLAG_OWN_CAR'].map({'Y':1,'N':0})
df['FLAG_OWN_REALTY'] = df['FLAG_OWN_REALTY'].map({'Y':1,'N':0})
df['EMERGENCYSTATE_MODE'] = df['EMERGENCYSTATE_MODE'].map({'Yes':1,'No':0, 'unknown': -1})
In [25]:
#after transform:

for b in boolean_list:
    print(df[b].value_counts())
    print('\n')
1    182800
0     95431
Name: CODE_GENDER, dtype: int64


0    183775
1     94456
Name: FLAG_OWN_CAR, dtype: int64


1    190206
0     88025
Name: FLAG_OWN_REALTY, dtype: int64


 0    143013
-1    133109
 1      2109
Name: EMERGENCYSTATE_MODE, dtype: int64


last look on missing data before moving to data analysis part¶

In [26]:
missing_inspection(df)
missing value in column SK_ID_CURR is 0.0 

missing value in column TARGET is 0.0 

number of missing values in this column is 0.0 
Cash loans    278231
Name: NAME_CONTRACT_TYPE, dtype: int64 

missing value in column CODE_GENDER is 0.0 

missing value in column FLAG_OWN_CAR is 0.0 

missing value in column FLAG_OWN_REALTY is 0.0 

missing value in column CNT_CHILDREN is 0.0 

missing value in column AMT_INCOME_TOTAL is 0.0 

missing value in column AMT_CREDIT is 0.0 

missing value in column AMT_ANNUITY is 0.0 

missing value in column AMT_GOODS_PRICE is 0.0 

number of missing values in this column is 0.0 
Unaccompanied      224540
Family              36687
Spouse, partner     10431
Children             2965
Other_B              1592
unknown              1007
Other_A               766
Group of people       243
Name: NAME_TYPE_SUITE, dtype: int64 

number of missing values in this column is 0.0 
Working                 142718
Commercial associate     63652
Pensioner                51993
State servant            19836
Student                     15
Unemployed                  15
Maternity leave              2
Name: NAME_INCOME_TYPE, dtype: int64 

number of missing values in this column is 0.0 
Secondary / secondary special    200125
Higher education                  65321
Incomplete higher                  9031
Lower secondary                    3608
Academic degree                     146
Name: NAME_EDUCATION_TYPE, dtype: int64 

number of missing values in this column is 0.0 
Married                 178710
Single / not married     39709
Civil marriage           26981
Separated                17846
Widow                    14985
Name: NAME_FAMILY_STATUS, dtype: int64 

number of missing values in this column is 0.0 
House / apartment      247388
With parents            12981
Municipal apartment     10237
Rented apartment         4290
Office apartment         2341
Co-op apartment           994
Name: NAME_HOUSING_TYPE, dtype: int64 

missing value in column REGION_POPULATION_RELATIVE is 0.0 

missing value in column DAYS_BIRTH is 0.0 

missing value in column DAYS_EMPLOYED is 0.0 

missing value in column DAYS_REGISTRATION is 0.0 

missing value in column DAYS_ID_PUBLISH is 0.0 

missing value in column OWN_CAR_AGE is 0.0 

missing value in column FLAG_MOBIL is 0.0 

missing value in column FLAG_EMP_PHONE is 0.0 

missing value in column FLAG_WORK_PHONE is 0.0 

missing value in column FLAG_CONT_MOBILE is 0.0 

missing value in column FLAG_PHONE is 0.0 

missing value in column FLAG_EMAIL is 0.0 

number of missing values in this column is 0.0 
unknown                  88799
Laborers                 50131
Sales staff              28534
Core staff               24374
Managers                 18739
Drivers                  17248
High skill tech staff    10008
Accountants               8607
Medicine staff            7740
Security staff            6178
Cooking staff             5437
Cleaning staff            4235
Private service staff     2359
Low-skill Laborers        1930
Waiters/barmen staff      1170
Secretaries               1169
Realty agents              675
HR staff                   478
IT staff                   420
Name: OCCUPATION_TYPE, dtype: int64 

missing value in column CNT_FAM_MEMBERS is 0.0 

missing value in column REGION_RATING_CLIENT is 0.0 

missing value in column REGION_RATING_CLIENT_W_CITY is 0.0 

number of missing values in this column is 0.0 
TUESDAY      49110
WEDNESDAY    47035
MONDAY       45954
THURSDAY     45815
FRIDAY       45382
SATURDAY     30495
SUNDAY       14440
Name: WEEKDAY_APPR_PROCESS_START, dtype: int64 

missing value in column HOUR_APPR_PROCESS_START is 0.0 

missing value in column REG_REGION_NOT_LIVE_REGION is 0.0 

missing value in column REG_REGION_NOT_WORK_REGION is 0.0 

missing value in column LIVE_REGION_NOT_WORK_REGION is 0.0 

missing value in column REG_CITY_NOT_LIVE_CITY is 0.0 

missing value in column REG_CITY_NOT_WORK_CITY is 0.0 

missing value in column LIVE_CITY_NOT_WORK_CITY is 0.0 

number of missing values in this column is 0.0 
Business Entity Type 3    60755
XNA                       52008
Self-employed             34518
Other                     15014
Medicine                  10118
Government                 9469
Business Entity Type 2     9363
School                     8067
Trade: type 7              7058
Kindergarten               6273
Construction               6084
Business Entity Type 1     5245
Transport: type 4          4925
Trade: type 3              3051
Industry: type 9           3022
Industry: type 3           2986
Security                   2955
Housing                    2715
Industry: type 11          2433
Military                   2379
Agriculture                2293
Bank                       2111
Police                     2086
Transport: type 2          2002
Postal                     1985
Security Ministries        1763
Restaurant                 1610
Trade: type 2              1599
Services                   1400
University                 1174
Industry: type 7           1164
Transport: type 3          1081
Industry: type 1            950
Hotel                       879
Electricity                 843
Industry: type 4            767
Trade: type 6               551
Industry: type 5            538
Telecom                     523
Insurance                   513
Emergency                   505
Industry: type 2            409
Advertising                 368
Realtor                     333
Culture                     330
Industry: type 12           323
Trade: type 1               300
Mobile                      268
Legal Services              259
Cleaning                    236
Transport: type 1           179
Industry: type 6             97
Industry: type 10            94
Religion                     79
Industry: type 13            59
Trade: type 4                54
Trade: type 5                46
Industry: type 8             22
Name: ORGANIZATION_TYPE, dtype: int64 

missing value in column EXT_SOURCE_1 is 0.0 

missing value in column EXT_SOURCE_2 is 0.0 

missing value in column EXT_SOURCE_3 is 0.0 

missing value in column APARTMENTS_AVG is 0.0 

missing value in column BASEMENTAREA_AVG is 0.0 

missing value in column YEARS_BEGINEXPLUATATION_AVG is 0.0 

missing value in column YEARS_BUILD_AVG is 0.0 

missing value in column COMMONAREA_AVG is 0.0 

missing value in column ELEVATORS_AVG is 0.0 

missing value in column ENTRANCES_AVG is 0.0 

missing value in column FLOORSMAX_AVG is 0.0 

missing value in column FLOORSMIN_AVG is 0.0 

missing value in column LANDAREA_AVG is 0.0 

missing value in column LIVINGAPARTMENTS_AVG is 0.0 

missing value in column LIVINGAREA_AVG is 0.0 

missing value in column NONLIVINGAPARTMENTS_AVG is 0.0 

missing value in column NONLIVINGAREA_AVG is 0.0 

missing value in column APARTMENTS_MODE is 0.0 

missing value in column BASEMENTAREA_MODE is 0.0 

missing value in column YEARS_BEGINEXPLUATATION_MODE is 0.0 

missing value in column YEARS_BUILD_MODE is 0.0 

missing value in column COMMONAREA_MODE is 0.0 

missing value in column ELEVATORS_MODE is 0.0 

missing value in column ENTRANCES_MODE is 0.0 

missing value in column FLOORSMAX_MODE is 0.0 

missing value in column FLOORSMIN_MODE is 0.0 

missing value in column LANDAREA_MODE is 0.0 

missing value in column LIVINGAPARTMENTS_MODE is 0.0 

missing value in column LIVINGAREA_MODE is 0.0 

missing value in column NONLIVINGAPARTMENTS_MODE is 0.0 

missing value in column NONLIVINGAREA_MODE is 0.0 

missing value in column APARTMENTS_MEDI is 0.0 

missing value in column BASEMENTAREA_MEDI is 0.0 

missing value in column YEARS_BEGINEXPLUATATION_MEDI is 0.0 

missing value in column YEARS_BUILD_MEDI is 0.0 

missing value in column COMMONAREA_MEDI is 0.0 

missing value in column ELEVATORS_MEDI is 0.0 

missing value in column ENTRANCES_MEDI is 0.0 

missing value in column FLOORSMAX_MEDI is 0.0 

missing value in column FLOORSMIN_MEDI is 0.0 

missing value in column LANDAREA_MEDI is 0.0 

missing value in column LIVINGAPARTMENTS_MEDI is 0.0 

missing value in column LIVINGAREA_MEDI is 0.0 

missing value in column NONLIVINGAPARTMENTS_MEDI is 0.0 

missing value in column NONLIVINGAREA_MEDI is 0.0 

number of missing values in this column is 0.0 
unknown                  191209
reg oper account          66104
reg oper spec account     10854
not specified              5086
org spec account           4978
Name: FONDKAPREMONT_MODE, dtype: int64 

number of missing values in this column is 0.0 
unknown             140831
block of flats      134948
specific housing      1355
terraced house        1097
Name: HOUSETYPE_MODE, dtype: int64 

missing value in column TOTALAREA_MODE is 0.0 

number of missing values in this column is 0.0 
unknown         142642
Panel            59339
Stone, brick     58088
Block             8266
Wooden            4851
Mixed             2048
Monolithic        1537
Others            1460
Name: WALLSMATERIAL_MODE, dtype: int64 

missing value in column EMERGENCYSTATE_MODE is 0.0 

missing value in column OBS_30_CNT_SOCIAL_CIRCLE is 0.0 

missing value in column DEF_30_CNT_SOCIAL_CIRCLE is 0.0 

missing value in column OBS_60_CNT_SOCIAL_CIRCLE is 0.0 

missing value in column DEF_60_CNT_SOCIAL_CIRCLE is 0.0 

missing value in column DAYS_LAST_PHONE_CHANGE is 0.0 

missing value in column FLAG_DOCUMENT_2 is 0.0 

missing value in column FLAG_DOCUMENT_3 is 0.0 

missing value in column FLAG_DOCUMENT_4 is 0.0 

missing value in column FLAG_DOCUMENT_5 is 0.0 

missing value in column FLAG_DOCUMENT_6 is 0.0 

missing value in column FLAG_DOCUMENT_7 is 0.0 

missing value in column FLAG_DOCUMENT_8 is 0.0 

missing value in column FLAG_DOCUMENT_9 is 0.0 

missing value in column FLAG_DOCUMENT_10 is 0.0 

missing value in column FLAG_DOCUMENT_11 is 0.0 

missing value in column FLAG_DOCUMENT_12 is 0.0 

missing value in column FLAG_DOCUMENT_13 is 0.0 

missing value in column FLAG_DOCUMENT_14 is 0.0 

missing value in column FLAG_DOCUMENT_15 is 0.0 

missing value in column FLAG_DOCUMENT_16 is 0.0 

missing value in column FLAG_DOCUMENT_17 is 0.0 

missing value in column FLAG_DOCUMENT_18 is 0.0 

missing value in column FLAG_DOCUMENT_19 is 0.0 

missing value in column FLAG_DOCUMENT_20 is 0.0 

missing value in column FLAG_DOCUMENT_21 is 0.0 

missing value in column AMT_REQ_CREDIT_BUREAU_HOUR is 0.0 

missing value in column AMT_REQ_CREDIT_BUREAU_DAY is 0.0 

missing value in column AMT_REQ_CREDIT_BUREAU_WEEK is 0.0 

missing value in column AMT_REQ_CREDIT_BUREAU_MON is 0.0 

missing value in column AMT_REQ_CREDIT_BUREAU_QRT is 0.0 

missing value in column AMT_REQ_CREDIT_BUREAU_YEAR is 0.0 

Summary of missing data: 0 cols has missing value, in which Numeric is 0 cols and Category is 0 cols 
 
 NUMERIC: 
 {}, 
 
 CATEGORY: 
 {}

all clean, now let's move to more interesting part: Data statistic and visualization. After this step, single data will be well understood and proper preparation for data transformation.

2. EXPLORATORY DATA ANALYSIS (EDA)¶

In this section, we will explore on statistical aspects of data, generally cover below step:

1. Data Visualization;
2. Summary Statistics; 
3. Correlation Analysis; 
4. Distribution Analysis.

Beforehand, let's build some function that will be used repeatedly:

In [28]:
#create neccessary plot for data visualization
# mainly use bar plot, histplot and plot_bar_target (when need to explore default rate association)

def plot_bar(df,column):
    plt.figure(figsize=(10,6))
    ax = sns.countplot(data=df, x=df[column],\
          palette=[sns.color_palette("pastel")[0]],order=df[column].value_counts().index)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
    plt.title('distribution of ' + column)
    plt.show() 
    plt.close()
    
def plot_hist(df, column):
    plt.figure(figsize=(10, 6))  # Set the figure size
    
    # Plot the histogram using Seaborn
    sns.histplot(df[column], color="skyblue", bins=30)
    
    # Customize the plot
    plt.title('Distribution of ' + column)  # Add title
    plt.xlabel(column)  # Add x-axis label
    plt.ylabel('Frequency')  # Add y-axis label
    plt.grid(axis='y', linestyle='--', alpha=0.7)  # Add grid for y-axis
    
    plt.show() 
    plt.close()
     
def plot_bar_target(df, column):
    df_data = df.groupby(column)['TARGET'].value_counts(normalize=True).unstack() * 100

    # Reset index to convert 'column' from index to column
    df_data = df_data.reset_index()

    # Create a bar plot using Seaborn
    fig, ax = plt.subplots(figsize=(10, 6)) 
    sns.barplot(x=column, y=0, color='skyblue', label='good', data=df_data)
    sns.barplot(x=column, y=1, color='lightcoral', label='bad', data=df_data)

    # Customize the plot
    plt.xlabel(column)
    plt.ylabel('%')
    plt.xticks(rotation=90)
    plt.title(column + ' by default rate')
    plt.legend(title='label')

    # Annotate the bars with percentages
    for p in ax.patches:
        height = p.get_height()
        ax.annotate(f'{height:.1f}%', (p.get_x() + p.get_width() / 2., height),
                    ha='center', va='bottom', fontsize=10, color='black')

    plt.tight_layout()
    plt.show()
    plt.close()

cat_list = []
num_list = []
for i in df.columns.to_list():
    if (df[i].dtypes == 'int64' or df[i].dtypes == 'float64') and len(df[i].value_counts()) >= 20:
        num_list.append(i)
    else:
         cat_list.append(i)

2.1 Data Visualization & Summary Statistics¶

firstly let's have a look on category data

In [29]:
for i in cat_list[2:len(cat_list)]:
    plot_bar(df,i)

We can observe certain group has higher proportion in this dataset, let's have a look on how they associated with default rate

In [30]:
for i in cat_list[2:len(cat_list)]:
    plot_bar_target(df,i)

from above visualization, we can detect some following highights:

  1. ORGANIZATION_TYPE has too many categories, will need to re-grouping into < 20 categories.
  2. INCOME, EDUCATION, OCCUPATION and CHILDREN shows most differentiate of default rate amonsgt the categories.
  3. INCOME_TYPE suggests most vulnerable group (high default rate) is on UNEMPLOYED & MATERNITY LEAVE. However, these groups account small proportion in training population.
  4. EMPLOYED_PHONE presenting show higher default rate.
  5. LIVING or WORKING address different than PERMANENT address show higher default rate. These groups account small proportion in training population.
  6. for FLAG_DOCUMENT_x, sharing similar character as providing optional documents, consider to create new feature combining these data following logic: FLAG_EXTRA_DOCUMENT for FLAG_DOCUMENT_x = 1 then FLAG_EXTRA_DOCUMENT = 1 else 0.

update ORGANIZATION_TYPE & FLAG_DOCUMENT_x¶

In [31]:
#ORGANIZATION regrouping into smaller group at df['ORGANIZATION_TYPE_TRANSF']

def map_category_to_group(category):
    if category in ['Business Entity Type 1',
                    'Business Entity Type 2',
                    'Business Entity Type 3']:
        return 'Business Entity'
    
    elif category in ['Industry: type 1',
             'Industry: type 10',
             'Industry: type 11',
             'Industry: type 12',
             'Industry: type 13',
             'Industry: type 2',
             'Industry: type 3',
             'Industry: type 4',
             'Industry: type 5',
             'Industry: type 6',
             'Industry: type 7',
             'Industry: type 8',
             'Industry: type 9']:
        return 'Industry'
    
    elif category in ['Trade: type 1',
                 'Trade: type 2',
                 'Trade: type 3',
                 'Trade: type 4',
                 'Trade: type 5',
                 'Trade: type 6',
                 'Trade: type 7']:
        return 'Trade'
    
    elif category in ['Transport: type 1',
             'Transport: type 2',
             'Transport: type 3',
             'Transport: type 4']:
        return 'Transport'
    
    elif category in ['Bank','Insurance' ]:
        return 'Financial Service'
    
    elif category in ['Military','Police','Security Ministries', 'Government' ]:
        return 'Gov Officer'
    
    elif category in ['Hotel','Restaurant','Services', 'Cleaning']:
        return 'Hospitality Industry'
    
    elif category in ['Housing','Realtor']:
        return 'Real Estate'
    
    elif category in ['Telecom','Mobile', 'Postal']:
        return 'Telcomunicate'
    else:
        return category

df['ORGANIZATION_TYPE_TRANSF'] = df['ORGANIZATION_TYPE'].apply(map_category_to_group)
In [32]:
#grouping DOCUMENT_TYPE_x into FLAG_EXTRA_DOCUMENT
flag_doc = [col for col in df.columns if 'FLAG_DOCUMENT_' in col] #update documents list to look on default rate
flag_doc.remove('FLAG_DOCUMENT_3') #most of customer presenting this document, therefore we don't want to group it up.

#if customer provide one of document (value = 1), to classify value 1 into FLAG_DOCUMENT_EXTRA
df['FLAG_DOCUMENT_EXTRA'] = df[flag_doc].max(axis=1)
In [33]:
plot_bar(df, 'FLAG_DOCUMENT_EXTRA')
In [34]:
plot_bar_target(df, 'FLAG_DOCUMENT_EXTRA')

moving to numeric data, let's look on distribution of this group¶

In [35]:
for i in num_list[1:len(num_list)]:
    plot_hist(df, i)

we can observe following highlights with respective strategy:

  1. anomaly data on DAYS_EMPLOYED, to inspect and update to invalid group if needed.
  2. anomaly data on OWN_CAR_AGE after 60, to inspect and consider to group into bins.
  3. normalize data to group into quartile bins.
  4. remaining data to group into 10 bins.

After these refining steps, will have a look on association of grouping data vs. default rate.

inspect DAYS_EMPLOYED & OWN_CAR_AGE¶

In [36]:
df[df['DAYS_EMPLOYED'] > 0]['DAYS_EMPLOYED'].describe()
Out[36]:
count    52008.00
mean    365243.00
std          0.00
min     365243.00
25%     365243.00
50%     365243.00
75%     365243.00
max     365243.00
Name: DAYS_EMPLOYED, dtype: float64

As we observe there is only 1 value repeated on 52008 records, to update it to invalid group, denote by value 1.

In [37]:
df['DAYS_EMPLOYED'] = df['DAYS_EMPLOYED'].apply(lambda x: 1 if x > 0 else x)

Next, let's have a look on OWN_CAR_AGE

In [38]:
df[df['OWN_CAR_AGE'] > 60]['OWN_CAR_AGE'].describe()
Out[38]:
count   2766.00
mean      64.30
std        0.69
min       63.00
25%       64.00
50%       64.00
75%       65.00
max       91.00
Name: OWN_CAR_AGE, dtype: float64

We have 2766 records on different values, therefore let's group them into 1 group.

In [39]:
bins, bin_edges = pd.qcut(df['OWN_CAR_AGE'], q=10, duplicates='drop',retbins=True)
bin_edges = [-1,0,4,8,14,60,float('+inf')]
df['OWN_CAR_AGE_BIN'] = pd.cut(df['OWN_CAR_AGE'],bins = bin_edges, right=False)
In [40]:
plot_bar(df,'OWN_CAR_AGE_BIN')
In [41]:
plot_bar_target(df,'OWN_CAR_AGE_BIN')

from above plot, we can observe that customer has newer car, from 0 to 8 years, has lower default rate.

Group normalize data into quartile & invalid¶

In [42]:
#normalize data binning
for col in num_list[1:len(num_list)]:
    if col in data_desc[data_desc['Special'].isin(['normalized', 'normalized '])]['Row'].tolist():
        bin_edges = [-1,0,0.25,0.5,0.75,float('+inf')]
        new_col = col + '_BIN'
        df[new_col] = pd.cut(df[col],bins = bin_edges, right=False)


#days data binning
for col in num_list[1:len(num_list)]:
    if col in data_desc[data_desc['Special']=='time only relative to the application']['Row'].tolist() and len(df[df[col]==1]) > 0:
        bins, bin_edges = pd.qcut(df[col], q=10, duplicates='drop',retbins=True)
        bin_edges[0] = float('-inf')
        bin_edges[-2] = 1
        bin_edges[-1] = float('+inf')
        
        new_col = col + '_BIN'
        df[new_col] = pd.cut(df[col],bins = bin_edges, right=False)
        
    elif col in data_desc[data_desc['Special']=='time only relative to the application']['Row'].tolist() and len(df[df[col]==1]) == 0:
        bins, bin_edges = pd.qcut(df[col], q=10, duplicates='drop',retbins=True)
        bin_edges[0] = float('-inf')
        bin_edges[-1] = float('+inf')
        
        new_col = col + '_BIN'
        df[new_col] = pd.cut(df[col],bins = bin_edges, right=False)


#remaning continous data binning
for col in num_list[1:len(num_list)]:
    if col not in (data_desc[data_desc['Special'].isin(['normalized', 'normalized '])]['Row'].tolist() + data_desc[data_desc['Special']=='time only relative to the application']['Row'].tolist()):
        bins, bin_edges = pd.qcut(df[col], q=10, duplicates='drop', retbins=True)
        bin_edges[0] = float('-inf')
        bin_edges[-1] = float('+inf')

        new_col = col + '_BIN'
        df[new_col] = pd.cut(df[col],bins = bin_edges, right=False)

#exceptional binning on REGION_POPULATION_RELATIVE due to data skew:
bins, bin_edges = pd.qcut(df['REGION_POPULATION_RELATIVE'], q=5, duplicates='drop', retbins=True)
bin_edges[0] =  float('-inf')
df['REGION_POPULATION_RELATIVE_BIN'] = pd.cut(df['REGION_POPULATION_RELATIVE'],bins = bin_edges, right=False)
df['REGION_POPULATION_RELATIVE_BIN'].value_counts().sort_index(axis = 0)
        
bin_list = [col for col in df.columns if '_BIN' in col] #update bin list to look on default rate

Now let's have a look on the list of new bin to see association with default rate

In [43]:
for col in bin_list:
    plot_bar(df, col)
In [44]:
for col in bin_list:
    plot_bar_target(df,col)

base on above plot, we can observe these highlights:

  1. EXT_SOURCE_x show differentiate of default rate amoungst bins;
  2. AMT_GOODS_PRICE shows differentiate of default rate amongst bins

Now we finish on data analysis, let's use these groups and update to calculate IV and WoE in order to select meaningful feature for model.

2.2. Correlation & Data distribution analysis¶

First, let's have a quick look on heatmap correlation to identify potential issue on existing feature list.

In [100]:
#once finalize the list after calculate WoE, to check correlation of selected features
corr_matrix = df[num_list[2:len(num_list)]].corr()

plt.figure(figsize=(15, 13))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', vmin=-1, vmax=1)
    
plt.xticks(fontsize=7)  
plt.yticks(fontsize=7)
plt.title('Correlation Heatmap', fontsize=10)
plt.show()
plt.close()

during data inspection as well as above roughcheck on correlation heatmap, we can notice that there are potential multicollinearity, especially on group of normalize data. To confirm this hypothesis, let's calculate VIF on list of single numeric variable.

In [99]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
np.seterr(divide='ignore', invalid='ignore') #ignore error

# Calculate VIF for each predictor variable, except group of DOCUMENT have been merged into FLAG_DOCUMENT_EXTRA
col = df.columns[3:95].tolist()
col.extend(['FLAG_DOCUMENT_EXTRA', 'FLAG_DOCUMENT_3'])
col.extend(df.columns[117:122].tolist()) 
vif_data = df[col].select_dtypes(include=np.number).copy()  # Ensure to work with numeric data only

for i in df.columns[117:122]:
    vif_data[i] = df[i].copy()

vif_data['Intercept'] = 1

vif = pd.DataFrame()
vif['VAR'] = vif_data.columns
vif['VIF'] = [variance_inflation_factor(vif_data.values, i) for i in range(vif_data.shape[1])]
len(vif[vif['VIF'] >= 10]) #check number of inflation feature 
Out[99]:
50

base on above result, we can see there are 50 numeric variables has VIF >= 10, confirm that there are multicollinearity in existing list features. To solve this issue, we will remove these high inflation variables from selected features. next step, let's have a quick look on correlation of remaining features.

In [101]:
corr_matrix = df[vif[vif['VIF'] < 10]['VAR'].tolist()[0:-1]].corr()

plt.figure(figsize=(15, 13))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', vmin=-1, vmax=1)
    
plt.xticks(fontsize=7)  
plt.yticks(fontsize=7)
plt.title('Correlation Heatmap', fontsize=10)
plt.show()
plt.close()

We can observe more reasonable level of correlation amongst features. Next step, let's calculate WoE and IV on categorical data, including _BIN of selected continuous data.

In [102]:
remove_cat = []
for i in cat_list:
    if len(df[i].value_counts()) <= 1:
        remove_cat.append(i)
refine_cat_list = [el for el in cat_list if el not in remove_cat]

selected_list=[]
for i in vif[vif['VIF'] < 10]['VAR'].tolist()[0:-1]:
    bin_var = i+'_BIN'
    if bin_var in bin_list:
        selected_list.append(bin_var)
    else:
        selected_list.append(i)

selected_list = selected_list + refine_cat_list
selected_list.append('ORGANIZATION_TYPE_TRANSF')

set_list = set()
for s in selected_list:
    set_list.add(s)

doc_list = set(['FLAG_DOCUMENT_11',
 'FLAG_DOCUMENT_12',
 'FLAG_DOCUMENT_13',
 'FLAG_DOCUMENT_14',
 'FLAG_DOCUMENT_15',
 'FLAG_DOCUMENT_16',
 'FLAG_DOCUMENT_17',
 'FLAG_DOCUMENT_18',
 'FLAG_DOCUMENT_19',
 'FLAG_DOCUMENT_2',
 'FLAG_DOCUMENT_20',
 'FLAG_DOCUMENT_21',
 'FLAG_DOCUMENT_4',
 'FLAG_DOCUMENT_5',
 'FLAG_DOCUMENT_6',
 'FLAG_DOCUMENT_8',
 'FLAG_DOCUMENT_9',
 'ORGANIZATION_TYPE'])

selected_list = [x for x in set_list if x not in doc_list]
selected_list.sort()
len(selected_list)
Out[102]:
54
In [62]:
# Function to calculate WoE and IV for a single feature
def calculate_iv(df, feature, target):
    # Create a summary DataFrame to calculate IV
    summary = df.groupby(feature)[target].agg(['count', 'sum'])
    summary.columns = ['#total', '#bad']
    summary['#good'] = summary['#total'] - summary['#bad']
    summary['%bad'] = summary['#bad'] / summary['#bad'].sum()
    summary['%good'] = summary['#good'] / summary['#good'].sum()
    summary['woe'] = np.log(summary['%good'] / summary['%bad'])
    summary['iv'] = (summary['%good'] - summary['%bad']) * summary['woe']
    iv = summary['iv'].sum()
    return iv, summary

# Calculate IV for each feature in the DataFrame
iv_values = {}
summaries = {}

for feature in selected_list:
    if feature != 'TARGET':  # Exclude the target variable
        iv, summary = calculate_iv(df, feature, 'TARGET')
        iv_values[feature] = iv
        summaries[feature] = summary

# Print IV values for each feature
for feature, iv in iv_values.items():
    print(f"IV for {feature}: {iv:.4f}")

# Optionally, print summaries for each feature
for feature, summary in summaries.items():
    print(f"\nSummary for {feature}:")
    print(summary)
    
IV for AMT_ANNUITY_BIN: 0.0285
IV for AMT_INCOME_TOTAL_BIN: 0.0091
IV for AMT_REQ_CREDIT_BUREAU_DAY: inf
IV for AMT_REQ_CREDIT_BUREAU_HOUR: inf
IV for AMT_REQ_CREDIT_BUREAU_MON_BIN: 0.0138
IV for AMT_REQ_CREDIT_BUREAU_QRT: inf
IV for AMT_REQ_CREDIT_BUREAU_WEEK: inf
IV for AMT_REQ_CREDIT_BUREAU_YEAR_BIN: 0.0174
IV for CNT_CHILDREN: inf
IV for CNT_FAM_MEMBERS: inf
IV for CODE_GENDER: 0.0425
IV for DAYS_BIRTH_BIN: 0.0905
IV for DAYS_EMPLOYED_BIN: 0.1157
IV for DAYS_ID_PUBLISH_BIN: 0.0379
IV for DAYS_REGISTRATION_BIN: 0.0281
IV for DEF_30_CNT_SOCIAL_CIRCLE: inf
IV for DEF_60_CNT_SOCIAL_CIRCLE: inf
IV for EMERGENCYSTATE_MODE: 0.0220
IV for EXT_SOURCE_1_BIN: 0.1360
IV for EXT_SOURCE_2_BIN: 0.2616
IV for EXT_SOURCE_3_BIN: 0.3044
IV for FLAG_CONT_MOBILE: 0.0000
IV for FLAG_DOCUMENT_3: 0.0163
IV for FLAG_DOCUMENT_EXTRA: 0.0221
IV for FLAG_EMAIL: 0.0000
IV for FLAG_EMP_PHONE: 0.0374
IV for FLAG_MOBIL: 0.0000
IV for FLAG_OWN_CAR: 0.0061
IV for FLAG_OWN_REALTY: 0.0003
IV for FLAG_PHONE: 0.0094
IV for FLAG_WORK_PHONE: 0.0087
IV for FONDKAPREMONT_MODE: 0.0117
IV for HOUR_APPR_PROCESS_START_BIN: 0.0089
IV for HOUSETYPE_MODE: 0.0207
IV for LIVE_CITY_NOT_WORK_CITY: 0.0137
IV for LIVE_REGION_NOT_WORK_REGION: 0.0002
IV for NAME_EDUCATION_TYPE: 0.0444
IV for NAME_FAMILY_STATUS: 0.0227
IV for NAME_HOUSING_TYPE: 0.0159
IV for NAME_INCOME_TYPE: inf
IV for NAME_TYPE_SUITE: 0.0026
IV for OCCUPATION_TYPE: 0.0820
IV for ORGANIZATION_TYPE_TRANSF: 0.0665
IV for OWN_CAR_AGE_BIN: 0.0188
IV for REGION_POPULATION_RELATIVE_BIN: 0.0146
IV for REGION_RATING_CLIENT: 0.0465
IV for REGION_RATING_CLIENT_W_CITY: 0.0492
IV for REG_CITY_NOT_LIVE_CITY: 0.0228
IV for REG_CITY_NOT_WORK_CITY: 0.0331
IV for REG_REGION_NOT_LIVE_REGION: 0.0005
IV for REG_REGION_NOT_WORK_REGION: 0.0009
IV for WALLSMATERIAL_MODE: 0.0254
IV for WEEKDAY_APPR_PROCESS_START: 0.0008

Summary for AMT_ANNUITY_BIN:
                    #total  #bad  #good  %bad  %good   woe   iv
AMT_ANNUITY_BIN                                                
[-inf, 12672.0)      27804  1908  25896  0.08   0.10  0.21 0.00
[12672.0, 16456.5)   27497  2129  25368  0.09   0.10  0.08 0.00
[16456.5, 19975.5)   28167  2606  25561  0.11   0.10 -0.11 0.00
[19975.5, 23089.5)   27751  2623  25128  0.11   0.10 -0.14 0.00
[23089.5, 26086.5)   27588  2322  25266  0.10   0.10 -0.01 0.00
[26086.5, 29299.5)   28116  2623  25493  0.11   0.10 -0.12 0.00
[29299.5, 32895.0)   27449  2811  24638  0.12   0.10 -0.23 0.01
[32895.0, 38236.5)   28208  2412  25796  0.10   0.10 -0.03 0.00
[38236.5, 46786.5)   27821  2165  25656  0.09   0.10  0.08 0.00
[46786.5, inf)       27830  1622  26208  0.07   0.10  0.39 0.01

Summary for AMT_INCOME_TOTAL_BIN:
                      #total  #bad  #good  %bad  %good   woe   iv
AMT_INCOME_TOTAL_BIN                                             
[-inf, 81000.0)        24252  1962  22290  0.08   0.09  0.03 0.00
[81000.0, 99000.0)     28612  2466  26146  0.11   0.10 -0.04 0.00
[99000.0, 112500.0)     8634   744   7890  0.03   0.03 -0.03 0.00
[112500.0, 135000.0)   39545  3563  35982  0.15   0.14 -0.08 0.00
[135000.0, 153000.0)   37561  3284  34277  0.14   0.13 -0.05 0.00
[153000.0, 166500.0)   27134  2427  24707  0.10   0.10 -0.08 0.00
[166500.0, 189000.0)   28774  2521  26253  0.11   0.10 -0.05 0.00
[189000.0, 225000.0)   21826  1807  20019  0.08   0.08  0.01 0.00
[225000.0, 270000.0)   27375  2116  25259  0.09   0.10  0.08 0.00
[270000.0, inf)        34518  2331  32187  0.10   0.13  0.23 0.01

Summary for AMT_REQ_CREDIT_BUREAU_DAY:
                           #total   #bad   #good  %bad  %good   woe   iv
AMT_REQ_CREDIT_BUREAU_DAY                                               
-1.00                       37238   3985   33253  0.17   0.13 -0.27 0.01
0.00                       239620  19102  220518  0.82   0.86  0.05 0.00
1.00                         1189    118    1071  0.01   0.00 -0.19 0.00
2.00                           94     11      83  0.00   0.00 -0.38 0.00
3.00                           45      2      43  0.00   0.00  0.67 0.00
4.00                           25      3      22  0.00   0.00 -0.40 0.00
5.00                            9      0       9  0.00   0.00   inf  inf
6.00                            8      0       8  0.00   0.00   inf  inf
8.00                            1      0       1  0.00   0.00   inf  inf
9.00                            2      0       2  0.00   0.00   inf  inf

Summary for AMT_REQ_CREDIT_BUREAU_HOUR:
                            #total   #bad   #good  %bad  %good   woe   iv
AMT_REQ_CREDIT_BUREAU_HOUR                                               
-1.00                        37238   3985   33253  0.17   0.13 -0.27 0.01
0.00                        239523  19114  220409  0.82   0.86  0.05 0.00
1.00                          1411    116    1295  0.00   0.01  0.02 0.00
2.00                            49      6      43  0.00   0.00 -0.43 0.00
3.00                             9      0       9  0.00   0.00   inf  inf
4.00                             1      0       1  0.00   0.00   inf  inf

Summary for AMT_REQ_CREDIT_BUREAU_MON_BIN:
                               #total   #bad   #good  %bad  %good   woe   iv
AMT_REQ_CREDIT_BUREAU_MON_BIN                                               
[-inf, 0.0)                     37238   3985   33253  0.17   0.13 -0.27 0.01
[0.0, 1.0)                     200509  16166  184343  0.70   0.72  0.04 0.00
[1.0, inf)                      40484   3070   37414  0.13   0.15  0.10 0.00

Summary for AMT_REQ_CREDIT_BUREAU_QRT:
                           #total   #bad   #good  %bad  %good   woe   iv
AMT_REQ_CREDIT_BUREAU_QRT                                               
-1.00                       37238   3985   33253  0.17   0.13 -0.27 0.01
0.00                       194166  15808  178358  0.68   0.70  0.03 0.00
1.00                        31452   2108   29344  0.09   0.12  0.24 0.01
2.00                        13215   1136   12079  0.05   0.05 -0.03 0.00
3.00                         1615    119    1496  0.01   0.01  0.14 0.00
4.00                          446     52     394  0.00   0.00 -0.37 0.00
5.00                           60      6      54  0.00   0.00 -0.20 0.00
6.00                           25      6      19  0.00   0.00 -1.24 0.00
7.00                            6      0       6  0.00   0.00   inf  inf
8.00                            6      0       6  0.00   0.00   inf  inf
19.00                           1      1       0  0.00   0.00  -inf  inf
261.00                          1      0       1  0.00   0.00   inf  inf

Summary for AMT_REQ_CREDIT_BUREAU_WEEK:
                            #total   #bad   #good  %bad  %good   woe   iv
AMT_REQ_CREDIT_BUREAU_WEEK                                               
-1.00                        37238   3985   33253  0.17   0.13 -0.27 0.01
0.00                        233047  18599  214448  0.80   0.84  0.05 0.00
1.00                          7636    609    7027  0.03   0.03  0.05 0.00
2.00                           186     19     167  0.00   0.00 -0.22 0.00
3.00                            54      3      51  0.00   0.00  0.44 0.00
4.00                            34      4      30  0.00   0.00 -0.38 0.00
5.00                            10      1       9  0.00   0.00 -0.20 0.00
6.00                            19      1      18  0.00   0.00  0.49 0.00
7.00                             2      0       2  0.00   0.00   inf  inf
8.00                             5      0       5  0.00   0.00   inf  inf

Summary for AMT_REQ_CREDIT_BUREAU_YEAR_BIN:
                                #total  #bad  #good  %bad  %good   woe   iv
AMT_REQ_CREDIT_BUREAU_YEAR_BIN                                             
[-inf, 0.0)                      37238  3985  33253  0.17   0.13 -0.27 0.01
[0.0, 1.0)                       62941  4686  58255  0.20   0.23  0.12 0.00
[1.0, 2.0)                       57505  4351  53154  0.19   0.21  0.11 0.00
[2.0, 3.0)                       45984  3835  42149  0.17   0.17  0.00 0.00
[3.0, 4.0)                       31084  2547  28537  0.11   0.11  0.02 0.00
[4.0, inf)                       43479  3817  39662  0.16   0.16 -0.06 0.00

Summary for CNT_CHILDREN:
              #total   #bad   #good  %bad  %good   woe   iv
CNT_CHILDREN                                               
0             196452  15596  180856  0.67   0.71  0.05 0.00
1              54054   5048   49006  0.22   0.19 -0.12 0.00
2              23893   2177   21716  0.09   0.09 -0.10 0.00
3               3331    336    2995  0.01   0.01 -0.21 0.00
4                382     48     334  0.00   0.00 -0.46 0.00
5                 79      7      72  0.00   0.00 -0.07 0.00
6                 20      6      14  0.00   0.00 -1.55 0.00
7                  7      0       7  0.00   0.00   inf  inf
8                  2      0       2  0.00   0.00   inf  inf
9                  2      2       0  0.00   0.00  -inf  inf
10                 2      0       2  0.00   0.00   inf  inf
11                 1      1       0  0.00   0.00  -inf  inf
12                 1      0       1  0.00   0.00   inf  inf
14                 3      0       3  0.00   0.00   inf  inf
19                 2      0       2  0.00   0.00   inf  inf

Summary for CNT_FAM_MEMBERS:
                 #total   #bad   #good  %bad  %good   woe   iv
CNT_FAM_MEMBERS                                               
1.00              60832   5261   55571  0.23   0.22 -0.04 0.00
2.00             145017  11320  133697  0.49   0.52  0.07 0.00
3.00              46699   4270   42429  0.18   0.17 -0.10 0.00
4.00              22086   2001   20085  0.09   0.08 -0.09 0.00
5.00               3118    306    2812  0.01   0.01 -0.18 0.00
6.00                365     48     317  0.00   0.00 -0.51 0.00
7.00                 76      6      70  0.00   0.00  0.06 0.00
8.00                 19      6      13  0.00   0.00 -1.62 0.00
9.00                  6      0       6  0.00   0.00   inf  inf
10.00                 3      1       2  0.00   0.00 -1.70 0.00
11.00                 1      1       0  0.00   0.00  -inf  inf
12.00                 2      0       2  0.00   0.00   inf  inf
13.00                 1      1       0  0.00   0.00  -inf  inf
14.00                 1      0       1  0.00   0.00   inf  inf
15.00                 1      0       1  0.00   0.00   inf  inf
16.00                 2      0       2  0.00   0.00   inf  inf
20.00                 2      0       2  0.00   0.00   inf  inf

Summary for CODE_GENDER:
             #total   #bad   #good  %bad  %good   woe   iv
CODE_GENDER                                               
0             95431  10094   85337  0.43   0.33 -0.26 0.03
1            182800  13127  169673  0.57   0.67  0.16 0.02

Summary for DAYS_BIRTH_BIN:
                      #total  #bad  #good  %bad  %good   woe   iv
DAYS_BIRTH_BIN                                                   
[-inf, -22265.0)       27813  1388  26425  0.06   0.10  0.55 0.02
[-22265.0, -20581.0)   27828  1533  26295  0.07   0.10  0.45 0.02
[-20581.0, -19019.0)   27825  1809  26016  0.08   0.10  0.27 0.01
[-19019.0, -17363.0)   27812  2041  25771  0.09   0.10  0.14 0.00
[-17363.0, -15874.0)   27820  2211  25609  0.10   0.10  0.05 0.00
[-15874.0, -14537.0)   27834  2287  25547  0.10   0.10  0.02 0.00
[-14537.0, -13263.0)   27829  2555  25274  0.11   0.10 -0.10 0.00
[-13263.0, -11819.0)   27822  2857  24965  0.12   0.10 -0.23 0.01
[-11819.0, -10403.0)   27810  3150  24660  0.14   0.10 -0.34 0.01
[-10403.0, inf)        27838  3390  24448  0.15   0.10 -0.42 0.02

Summary for DAYS_EMPLOYED_BIN:
                    #total  #bad  #good  %bad  %good   woe   iv
DAYS_EMPLOYED_BIN                                              
[-inf, -4909.0)      27816  1382  26434  0.06   0.10  0.55 0.02
[-4909.0, -3252.0)   27810  1750  26060  0.08   0.10  0.30 0.01
[-3252.0, -2381.0)   27832  2077  25755  0.09   0.10  0.12 0.00
[-2381.0, -1708.0)   27818  2368  25450  0.10   0.10 -0.02 0.00
[-1708.0, -1220.0)   27799  2812  24987  0.12   0.10 -0.21 0.00
[-1220.0, -819.0)    27815  3060  24755  0.13   0.10 -0.31 0.01
[-819.0, -445.0)     27805  3293  24512  0.14   0.10 -0.39 0.02
[-445.0, 1.0)        31528  3629  27899  0.16   0.11 -0.36 0.02
[1.0, inf)           52008  2850  49158  0.12   0.19  0.45 0.03

Summary for DAYS_ID_PUBLISH_BIN:
                     #total  #bad  #good  %bad  %good   woe   iv
DAYS_ID_PUBLISH_BIN                                             
[-inf, -4729.0)       27819  1725  26094  0.07   0.10  0.32 0.01
[-4729.0, -4439.0)    27759  1748  26011  0.08   0.10  0.30 0.01
[-4439.0, -4184.0)    27882  1895  25987  0.08   0.10  0.22 0.00
[-4184.0, -3909.0)    27782  2215  25567  0.10   0.10  0.05 0.00
[-3909.0, -3294.0)    27839  2256  25583  0.10   0.10  0.03 0.00
[-3294.0, -2696.0)    27827  2409  25418  0.10   0.10 -0.04 0.00
[-2696.0, -2091.0)    27838  2571  25267  0.11   0.10 -0.11 0.00
[-2091.0, -1424.0)    27810  2671  25139  0.12   0.10 -0.15 0.00
[-1424.0, -757.0)     27828  2803  25025  0.12   0.10 -0.21 0.00
[-757.0, inf)         27847  2928  24919  0.13   0.10 -0.25 0.01

Summary for DAYS_REGISTRATION_BIN:
                       #total  #bad  #good  %bad  %good   woe   iv
DAYS_REGISTRATION_BIN                                             
[-inf, -9980.0)         27812  1615  26197  0.07   0.10  0.39 0.01
[-9980.0, -8269.0)      27822  1885  25937  0.08   0.10  0.23 0.00
[-8269.0, -6804.0)      27820  2078  25742  0.09   0.10  0.12 0.00
[-6804.0, -5484.0)      27813  2273  25540  0.10   0.10  0.02 0.00
[-5484.0, -4522.0)      27841  2489  25352  0.11   0.10 -0.08 0.00
[-4522.0, -3558.0)      27821  2534  25287  0.11   0.10 -0.10 0.00
[-3558.0, -2557.0)      27806  2487  25319  0.11   0.10 -0.08 0.00
[-2557.0, -1504.0)      27825  2420  25405  0.10   0.10 -0.05 0.00
[-1504.0, -698.0)       27814  2622  25192  0.11   0.10 -0.13 0.00
[-698.0, inf)           27857  2818  25039  0.12   0.10 -0.21 0.00

Summary for DEF_30_CNT_SOCIAL_CIRCLE:
                          #total   #bad   #good  %bad  %good   woe   iv
DEF_30_CNT_SOCIAL_CIRCLE                                               
-1.00                          1      0       1  0.00   0.00   inf  inf
0.00                      246054  19775  226279  0.85   0.89  0.04 0.00
1.00                       25885   2686   23199  0.12   0.09 -0.24 0.01
2.00                        4892    577    4315  0.02   0.02 -0.38 0.00
3.00                        1101    142     959  0.01   0.00 -0.49 0.00
4.00                         232     32     200  0.00   0.00 -0.56 0.00
5.00                          54      8      46  0.00   0.00 -0.65 0.00
6.00                          10      1       9  0.00   0.00 -0.20 0.00
7.00                           1      0       1  0.00   0.00   inf  inf
8.00                           1      0       1  0.00   0.00   inf  inf

Summary for DEF_60_CNT_SOCIAL_CIRCLE:
                          #total   #bad   #good  %bad  %good   woe   iv
DEF_60_CNT_SOCIAL_CIRCLE                                               
-1.00                          1      0       1  0.00   0.00   inf  inf
0.00                      254692  20623  234069  0.89   0.92  0.03 0.00
1.00                       19908   2131   17777  0.09   0.07 -0.27 0.01
2.00                        2924    357    2567  0.02   0.01 -0.42 0.00
3.00                         561     93     468  0.00   0.00 -0.78 0.00
4.00                         122     14     108  0.00   0.00 -0.35 0.00
5.00                          19      3      16  0.00   0.00 -0.72 0.00
6.00                           3      0       3  0.00   0.00   inf  inf
7.00                           1      0       1  0.00   0.00   inf  inf

Summary for EMERGENCYSTATE_MODE:
                     #total   #bad   #good  %bad  %good   woe   iv
EMERGENCYSTATE_MODE                                               
-1                   133109  12655  120454  0.54   0.47 -0.14 0.01
 0                   143013  10360  132653  0.45   0.52  0.15 0.01
 1                     2109    206    1903  0.01   0.01 -0.17 0.00

Summary for EXT_SOURCE_1_BIN:
                  #total   #bad   #good  %bad  %good   woe   iv
EXT_SOURCE_1_BIN                                               
[-1.0, 0.0)       158014  13841  144173  0.60   0.57 -0.05 0.00
[0.0, 0.25)        16987   2825   14162  0.12   0.06 -0.78 0.05
[0.25, 0.5)        41786   3759   38027  0.16   0.15 -0.08 0.00
[0.5, 0.75)        43943   2275   41668  0.10   0.16  0.51 0.03
[0.75, inf)        17501    521   16980  0.02   0.07  1.09 0.05

Summary for EXT_SOURCE_2_BIN:
                  #total  #bad   #good  %bad  %good   woe   iv
EXT_SOURCE_2_BIN                                              
[-1.0, 0.0)          598    50     548  0.00   0.00 -0.00 0.00
[0.0, 0.25)        34298  6157   28141  0.27   0.11 -0.88 0.14
[0.25, 0.5)        71839  7398   64441  0.32   0.25 -0.23 0.02
[0.5, 0.75)       158629  9256  149373  0.40   0.59  0.38 0.07
[0.75, inf)        12867   360   12507  0.02   0.05  1.15 0.04

Summary for EXT_SOURCE_3_BIN:
                  #total  #bad  #good  %bad  %good   woe   iv
EXT_SOURCE_3_BIN                                             
[-1.0, 0.0)        54657  5276  49381  0.23   0.19 -0.16 0.01
[0.0, 0.25)        26627  5258  21369  0.23   0.08 -0.99 0.14
[0.25, 0.5)        71203  6933  64270  0.30   0.25 -0.17 0.01
[0.5, 0.75)       103401  4998  98403  0.22   0.39  0.58 0.10
[0.75, inf)        22343   756  21587  0.03   0.08  0.96 0.05

Summary for FLAG_CONT_MOBILE:
                  #total   #bad   #good  %bad  %good   woe   iv
FLAG_CONT_MOBILE                                               
0                    141     13     128  0.00   0.00 -0.11 0.00
1                 278090  23208  254882  1.00   1.00  0.00 0.00

Summary for FLAG_DOCUMENT_3:
                 #total   #bad   #good  %bad  %good   woe   iv
FLAG_DOCUMENT_3                                               
0                 61033   4013   57020  0.17   0.22  0.26 0.01
1                217198  19208  197990  0.83   0.78 -0.06 0.00

Summary for FLAG_DOCUMENT_EXTRA:
                     #total   #bad   #good  %bad  %good   woe   iv
FLAG_DOCUMENT_EXTRA                                               
0                    214210  19153  195057  0.82   0.76 -0.08 0.00
1                     64021   4068   59953  0.18   0.24  0.29 0.02

Summary for FLAG_EMAIL:
            #total   #bad   #good  %bad  %good   woe   iv
FLAG_EMAIL                                               
0           262224  21909  240315  0.94   0.94 -0.00 0.00
1            16007   1312   14695  0.06   0.06  0.02 0.00

Summary for FLAG_EMP_PHONE:
                #total   #bad   #good  %bad  %good   woe   iv
FLAG_EMP_PHONE                                               
0                52018   2851   49167  0.12   0.19  0.45 0.03
1               226213  20370  205843  0.88   0.81 -0.08 0.01

Summary for FLAG_MOBIL:
            #total   #bad   #good  %bad  %good  woe   iv
FLAG_MOBIL                                              
1           278231  23221  255010  1.00   1.00 0.00 0.00

Summary for FLAG_OWN_CAR:
              #total   #bad   #good  %bad  %good   woe   iv
FLAG_OWN_CAR                                               
0             183775  16114  167661  0.69   0.66 -0.05 0.00
1              94456   7107   87349  0.31   0.34  0.11 0.00

Summary for FLAG_OWN_REALTY:
                 #total   #bad   #good  %bad  %good   woe   iv
FLAG_OWN_REALTY                                               
0                 88025   7520   80505  0.32   0.32 -0.03 0.00
1                190206  15701  174505  0.68   0.68  0.01 0.00

Summary for FLAG_PHONE:
            #total   #bad   #good  %bad  %good   woe   iv
FLAG_PHONE                                               
0           199162  17532  181630  0.76   0.71 -0.06 0.00
1            79069   5689   73380  0.24   0.29  0.16 0.01

Summary for FLAG_WORK_PHONE:
                 #total   #bad   #good  %bad  %good   woe   iv
FLAG_WORK_PHONE                                               
0                221547  17668  203879  0.76   0.80  0.05 0.00
1                 56684   5553   51131  0.24   0.20 -0.18 0.01

Summary for FONDKAPREMONT_MODE:
                       #total   #bad   #good  %bad  %good   woe   iv
FONDKAPREMONT_MODE                                                  
not specified            5086    405    4681  0.02   0.02  0.05 0.00
org spec account         4978    311    4667  0.01   0.02  0.31 0.00
reg oper account        66104   4800   61304  0.21   0.24  0.15 0.01
reg oper spec account   10854    735   10119  0.03   0.04  0.23 0.00
unknown                191209  16970  174239  0.73   0.68 -0.07 0.00

Summary for HOUR_APPR_PROCESS_START_BIN:
                             #total  #bad  #good  %bad  %good   woe   iv
HOUR_APPR_PROCESS_START_BIN                                             
[-inf, 8.0)                   20585  2150  18435  0.09   0.07 -0.25 0.01
[8.0, 9.0)                    13918  1296  12622  0.06   0.05 -0.12 0.00
[9.0, 10.0)                   25213  2114  23099  0.09   0.09 -0.01 0.00
[10.0, 11.0)                  34526  2840  31686  0.12   0.12  0.02 0.00
[11.0, 12.0)                  33841  2823  31018  0.12   0.12  0.00 0.00
[12.0, 13.0)                  30772  2623  28149  0.11   0.11 -0.02 0.00
[13.0, 14.0)                  27864  2316  25548  0.10   0.10  0.00 0.00
[14.0, 15.0)                  24898  2041  22857  0.09   0.09  0.02 0.00
[15.0, 16.0)                  22201  1734  20467  0.07   0.08  0.07 0.00
[16.0, inf)                   44413  3284  41129  0.14   0.16  0.13 0.00

Summary for HOUSETYPE_MODE:
                  #total   #bad   #good  %bad  %good   woe   iv
HOUSETYPE_MODE                                                 
block of flats    134948   9745  125203  0.42   0.49  0.16 0.01
specific housing    1355    143    1212  0.01   0.00 -0.26 0.00
terraced house      1097     96    1001  0.00   0.00 -0.05 0.00
unknown           140831  13237  127594  0.57   0.50 -0.13 0.01

Summary for LIVE_CITY_NOT_WORK_CITY:
                         #total   #bad   #good  %bad  %good   woe   iv
LIVE_CITY_NOT_WORK_CITY                                               
0                        228100  18045  210055  0.78   0.82  0.06 0.00
1                         50131   5176   44955  0.22   0.18 -0.23 0.01

Summary for LIVE_REGION_NOT_WORK_REGION:
                             #total   #bad   #good  %bad  %good   woe   iv
LIVE_REGION_NOT_WORK_REGION                                               
0                            267060  22223  244837  0.96   0.96  0.00 0.00
1                             11171    998   10173  0.04   0.04 -0.07 0.00

Summary for NAME_EDUCATION_TYPE:
                               #total   #bad   #good  %bad  %good   woe   iv
NAME_EDUCATION_TYPE                                                         
Academic degree                   146      3     143  0.00   0.00  1.47 0.00
Higher education                65321   3694   61627  0.16   0.24  0.42 0.03
Incomplete higher                9031    802    8229  0.03   0.03 -0.07 0.00
Lower secondary                  3608    397    3211  0.02   0.01 -0.31 0.00
Secondary / secondary special  200125  18325  181800  0.79   0.71 -0.10 0.01

Summary for NAME_FAMILY_STATUS:
                      #total   #bad   #good  %bad  %good   woe   iv
NAME_FAMILY_STATUS                                                 
Civil marriage         26981   2764   24217  0.12   0.09 -0.23 0.01
Married               178710  13970  164740  0.60   0.65  0.07 0.00
Separated              17846   1520   16326  0.07   0.06 -0.02 0.00
Single / not married   39709   4071   35638  0.18   0.14 -0.23 0.01
Widow                  14985    896   14089  0.04   0.06  0.36 0.01

Summary for NAME_HOUSING_TYPE:
                     #total   #bad   #good  %bad  %good   woe   iv
NAME_HOUSING_TYPE                                                 
Co-op apartment         994     83     911  0.00   0.00 -0.00 0.00
House / apartment    247388  19941  227447  0.86   0.89  0.04 0.00
Municipal apartment   10237    909    9328  0.04   0.04 -0.07 0.00
Office apartment       2341    159    2182  0.01   0.01  0.22 0.00
Rented apartment       4290    545    3745  0.02   0.01 -0.47 0.00
With parents          12981   1584   11397  0.07   0.04 -0.42 0.01

Summary for NAME_INCOME_TYPE:
                      #total   #bad   #good  %bad  %good   woe   iv
NAME_INCOME_TYPE                                                   
Commercial associate   63652   4990   58662  0.21   0.23  0.07 0.00
Maternity leave            2      2       0  0.00   0.00  -inf  inf
Pensioner              51993   2842   49151  0.12   0.19  0.45 0.03
State servant          19836   1188   18648  0.05   0.07  0.36 0.01
Student                   15      0      15  0.00   0.00   inf  inf
Unemployed                15      8       7  0.00   0.00 -2.53 0.00
Working               142718  14191  128527  0.61   0.50 -0.19 0.02

Summary for NAME_TYPE_SUITE:
                 #total   #bad   #good  %bad  %good   woe   iv
NAME_TYPE_SUITE                                               
Children           2965    226    2739  0.01   0.01  0.10 0.00
Family            36687   2820   33867  0.12   0.13  0.09 0.00
Group of people     243     21     222  0.00   0.00 -0.04 0.00
Other_A             766     74     692  0.00   0.00 -0.16 0.00
Other_B            1592    155    1437  0.01   0.01 -0.17 0.00
Spouse, partner   10431    851    9580  0.04   0.04  0.02 0.00
Unaccompanied    224540  19026  205514  0.82   0.81 -0.02 0.00
unknown            1007     48     959  0.00   0.00  0.60 0.00

Summary for OCCUPATION_TYPE:
                       #total  #bad  #good  %bad  %good   woe   iv
OCCUPATION_TYPE                                                   
Accountants              8607   434   8173  0.02   0.03  0.54 0.01
Cleaning staff           4235   409   3826  0.02   0.02 -0.16 0.00
Cooking staff            5437   577   4860  0.02   0.02 -0.27 0.00
Core staff              24374  1609  22765  0.07   0.09  0.25 0.01
Drivers                 17248  2002  15246  0.09   0.06 -0.37 0.01
HR staff                  478    33    445  0.00   0.00  0.21 0.00
High skill tech staff   10008   643   9365  0.03   0.04  0.28 0.00
IT staff                  420    33    387  0.00   0.00  0.07 0.00
Laborers                50131  5489  44642  0.24   0.18 -0.30 0.02
Low-skill Laborers       1930   344   1586  0.01   0.01 -0.87 0.01
Managers                18739  1241  17498  0.05   0.07  0.25 0.00
Medicine staff           7740   530   7210  0.02   0.03  0.21 0.00
Private service staff    2359   158   2201  0.01   0.01  0.24 0.00
Realty agents             675    55    620  0.00   0.00  0.03 0.00
Sales staff             28534  2833  25701  0.12   0.10 -0.19 0.00
Secretaries              1169    84   1085  0.00   0.00  0.16 0.00
Security staff           6178   684   5494  0.03   0.02 -0.31 0.00
Waiters/barmen staff     1170   137   1033  0.01   0.00 -0.38 0.00
unknown                 88799  5926  82873  0.26   0.32  0.24 0.02

Summary for ORGANIZATION_TYPE_TRANSF:
                          #total  #bad  #good  %bad  %good   woe   iv
ORGANIZATION_TYPE_TRANSF                                             
Advertising                  368    33    335  0.00   0.00 -0.08 0.00
Agriculture                 2293   239   2054  0.01   0.01 -0.25 0.00
Business Entity            75363  7197  68166  0.31   0.27 -0.15 0.01
Construction                6084   734   5350  0.03   0.02 -0.41 0.00
Culture                      330    17    313  0.00   0.00  0.52 0.00
Electricity                  843    62    781  0.00   0.00  0.14 0.00
Emergency                    505    37    468  0.00   0.00  0.14 0.00
Financial Service           2624   144   2480  0.01   0.01  0.45 0.00
Gov Officer                15697  1017  14680  0.04   0.06  0.27 0.00
Hospitality Industry        4125   372   3753  0.02   0.01 -0.08 0.00
Industry                   12864  1151  11713  0.05   0.05 -0.08 0.00
Kindergarten                6273   462   5811  0.02   0.02  0.14 0.00
Legal Services               259    21    238  0.00   0.00  0.03 0.00
Medicine                   10118   689   9429  0.03   0.04  0.22 0.00
Other                      15014  1196  13818  0.05   0.05  0.05 0.00
Real Estate                 3048   258   2790  0.01   0.01 -0.02 0.00
Religion                      79     5     74  0.00   0.00  0.30 0.00
School                      8067   493   7574  0.02   0.03  0.34 0.00
Security                    2955   306   2649  0.01   0.01 -0.24 0.00
Self-employed              34518  3622  30896  0.16   0.12 -0.25 0.01
Telcomunicate               2776   232   2544  0.01   0.01 -0.00 0.00
Trade                      12659  1197  11462  0.05   0.04 -0.14 0.00
Transport                   8187   825   7362  0.04   0.03 -0.21 0.00
University                  1174    62   1112  0.00   0.00  0.49 0.00
XNA                        52008  2850  49158  0.12   0.19  0.45 0.03

Summary for OWN_CAR_AGE_BIN:
                 #total   #bad   #good  %bad  %good   woe   iv
OWN_CAR_AGE_BIN                                               
[-inf, 2.0)      190182  16545  173637  0.71   0.68 -0.05 0.00
[2.0, 8.0)        31609   1823   29786  0.08   0.12  0.40 0.02
[8.0, 14.0)       26158   2037   24121  0.09   0.09  0.08 0.00
[14.0, inf)       30282   2816   27466  0.12   0.11 -0.12 0.00

Summary for REGION_POPULATION_RELATIVE_BIN:
                                #total  #bad  #good  %bad  %good   woe   iv
REGION_POPULATION_RELATIVE_BIN                                             
[-inf, 0.00863)                  54170  4968  49202  0.22   0.20 -0.09 0.00
[0.00863, 0.0166)                54917  4501  50416  0.20   0.20  0.03 0.00
[0.0166, 0.0207)                 53752  5174  48578  0.23   0.20 -0.14 0.00
[0.0207, 0.0308)                 54416  4544  49872  0.20   0.20  0.01 0.00
[0.0308, 0.0725)                 53619  3714  49905  0.16   0.20  0.22 0.01

Summary for REGION_RATING_CLIENT:
                      #total   #bad   #good  %bad  %good   woe   iv
REGION_RATING_CLIENT                                               
1                      28408   1451   26957  0.06   0.11  0.53 0.02
2                     205826  16711  189115  0.72   0.74  0.03 0.00
3                      43997   5059   38938  0.22   0.15 -0.36 0.02

Summary for REGION_RATING_CLIENT_W_CITY:
                             #total   #bad   #good  %bad  %good   woe   iv
REGION_RATING_CLIENT_W_CITY                                               
1                             30190   1545   28645  0.07   0.11  0.52 0.02
2                            208032  16963  191069  0.73   0.75  0.03 0.00
3                             40009   4713   35296  0.20   0.14 -0.38 0.02

Summary for REG_CITY_NOT_LIVE_CITY:
                        #total   #bad   #good  %bad  %good   woe   iv
REG_CITY_NOT_LIVE_CITY                                               
0                       256810  20498  236312  0.88   0.93  0.05 0.00
1                        21421   2723   18698  0.12   0.07 -0.47 0.02

Summary for REG_CITY_NOT_WORK_CITY:
                        #total   #bad   #good  %bad  %good   woe   iv
REG_CITY_NOT_WORK_CITY                                               
0                       214282  16185  198097  0.70   0.78  0.11 0.01
1                        63949   7036   56913  0.30   0.22 -0.31 0.02

Summary for REG_REGION_NOT_LIVE_REGION:
                            #total   #bad   #good  %bad  %good   woe   iv
REG_REGION_NOT_LIVE_REGION                                               
0                           274222  22827  251395  0.98   0.99  0.00 0.00
1                             4009    394    3615  0.02   0.01 -0.18 0.00

Summary for REG_REGION_NOT_WORK_REGION:
                            #total   #bad   #good  %bad  %good   woe   iv
REG_REGION_NOT_WORK_REGION                                               
0                           264418  21927  242491  0.94   0.95  0.01 0.00
1                            13813   1294   12519  0.06   0.05 -0.13 0.00

Summary for WALLSMATERIAL_MODE:
                    #total   #bad   #good  %bad  %good   woe   iv
WALLSMATERIAL_MODE                                               
Block                 8266    613    7653  0.03   0.03  0.13 0.00
Mixed                 2048    162    1886  0.01   0.01  0.06 0.00
Monolithic            1537     80    1457  0.00   0.01  0.51 0.00
Others                1460    123    1337  0.01   0.01 -0.01 0.00
Panel                59339   3918   55421  0.17   0.22  0.25 0.01
Stone, brick         58088   4454   53634  0.19   0.21  0.09 0.00
Wooden                4851    489    4362  0.02   0.02 -0.21 0.00
unknown             142642  13382  129260  0.58   0.51 -0.13 0.01

Summary for WEEKDAY_APPR_PROCESS_START:
                            #total  #bad  #good  %bad  %good   woe   iv
WEEKDAY_APPR_PROCESS_START                                             
FRIDAY                       45382  3835  41547  0.17   0.16 -0.01 0.00
MONDAY                       45954  3675  42279  0.16   0.17  0.05 0.00
SATURDAY                     30495  2475  28020  0.11   0.11  0.03 0.00
SUNDAY                       14440  1200  13240  0.05   0.05  0.00 0.00
THURSDAY                     45815  3841  41974  0.17   0.16 -0.00 0.00
TUESDAY                      49110  4250  44860  0.18   0.18 -0.04 0.00
WEDNESDAY                    47035  3945  43090  0.17   0.17 -0.01 0.00

we observe some features has IV = inf, let's inspect in detail:

In [63]:
inf_iv = []
for i, v in iv_values.items():
    if v ==  float('inf'):
        inf_iv.append(i)
        
for i in inf_iv:
    print(pd.DataFrame(summaries[i]))
                           #total   #bad   #good  %bad  %good   woe   iv
AMT_REQ_CREDIT_BUREAU_DAY                                               
-1.00                       37238   3985   33253  0.17   0.13 -0.27 0.01
0.00                       239620  19102  220518  0.82   0.86  0.05 0.00
1.00                         1189    118    1071  0.01   0.00 -0.19 0.00
2.00                           94     11      83  0.00   0.00 -0.38 0.00
3.00                           45      2      43  0.00   0.00  0.67 0.00
4.00                           25      3      22  0.00   0.00 -0.40 0.00
5.00                            9      0       9  0.00   0.00   inf  inf
6.00                            8      0       8  0.00   0.00   inf  inf
8.00                            1      0       1  0.00   0.00   inf  inf
9.00                            2      0       2  0.00   0.00   inf  inf
                            #total   #bad   #good  %bad  %good   woe   iv
AMT_REQ_CREDIT_BUREAU_HOUR                                               
-1.00                        37238   3985   33253  0.17   0.13 -0.27 0.01
0.00                        239523  19114  220409  0.82   0.86  0.05 0.00
1.00                          1411    116    1295  0.00   0.01  0.02 0.00
2.00                            49      6      43  0.00   0.00 -0.43 0.00
3.00                             9      0       9  0.00   0.00   inf  inf
4.00                             1      0       1  0.00   0.00   inf  inf
                           #total   #bad   #good  %bad  %good   woe   iv
AMT_REQ_CREDIT_BUREAU_QRT                                               
-1.00                       37238   3985   33253  0.17   0.13 -0.27 0.01
0.00                       194166  15808  178358  0.68   0.70  0.03 0.00
1.00                        31452   2108   29344  0.09   0.12  0.24 0.01
2.00                        13215   1136   12079  0.05   0.05 -0.03 0.00
3.00                         1615    119    1496  0.01   0.01  0.14 0.00
4.00                          446     52     394  0.00   0.00 -0.37 0.00
5.00                           60      6      54  0.00   0.00 -0.20 0.00
6.00                           25      6      19  0.00   0.00 -1.24 0.00
7.00                            6      0       6  0.00   0.00   inf  inf
8.00                            6      0       6  0.00   0.00   inf  inf
19.00                           1      1       0  0.00   0.00  -inf  inf
261.00                          1      0       1  0.00   0.00   inf  inf
                            #total   #bad   #good  %bad  %good   woe   iv
AMT_REQ_CREDIT_BUREAU_WEEK                                               
-1.00                        37238   3985   33253  0.17   0.13 -0.27 0.01
0.00                        233047  18599  214448  0.80   0.84  0.05 0.00
1.00                          7636    609    7027  0.03   0.03  0.05 0.00
2.00                           186     19     167  0.00   0.00 -0.22 0.00
3.00                            54      3      51  0.00   0.00  0.44 0.00
4.00                            34      4      30  0.00   0.00 -0.38 0.00
5.00                            10      1       9  0.00   0.00 -0.20 0.00
6.00                            19      1      18  0.00   0.00  0.49 0.00
7.00                             2      0       2  0.00   0.00   inf  inf
8.00                             5      0       5  0.00   0.00   inf  inf
              #total   #bad   #good  %bad  %good   woe   iv
CNT_CHILDREN                                               
0             196452  15596  180856  0.67   0.71  0.05 0.00
1              54054   5048   49006  0.22   0.19 -0.12 0.00
2              23893   2177   21716  0.09   0.09 -0.10 0.00
3               3331    336    2995  0.01   0.01 -0.21 0.00
4                382     48     334  0.00   0.00 -0.46 0.00
5                 79      7      72  0.00   0.00 -0.07 0.00
6                 20      6      14  0.00   0.00 -1.55 0.00
7                  7      0       7  0.00   0.00   inf  inf
8                  2      0       2  0.00   0.00   inf  inf
9                  2      2       0  0.00   0.00  -inf  inf
10                 2      0       2  0.00   0.00   inf  inf
11                 1      1       0  0.00   0.00  -inf  inf
12                 1      0       1  0.00   0.00   inf  inf
14                 3      0       3  0.00   0.00   inf  inf
19                 2      0       2  0.00   0.00   inf  inf
                 #total   #bad   #good  %bad  %good   woe   iv
CNT_FAM_MEMBERS                                               
1.00              60832   5261   55571  0.23   0.22 -0.04 0.00
2.00             145017  11320  133697  0.49   0.52  0.07 0.00
3.00              46699   4270   42429  0.18   0.17 -0.10 0.00
4.00              22086   2001   20085  0.09   0.08 -0.09 0.00
5.00               3118    306    2812  0.01   0.01 -0.18 0.00
6.00                365     48     317  0.00   0.00 -0.51 0.00
7.00                 76      6      70  0.00   0.00  0.06 0.00
8.00                 19      6      13  0.00   0.00 -1.62 0.00
9.00                  6      0       6  0.00   0.00   inf  inf
10.00                 3      1       2  0.00   0.00 -1.70 0.00
11.00                 1      1       0  0.00   0.00  -inf  inf
12.00                 2      0       2  0.00   0.00   inf  inf
13.00                 1      1       0  0.00   0.00  -inf  inf
14.00                 1      0       1  0.00   0.00   inf  inf
15.00                 1      0       1  0.00   0.00   inf  inf
16.00                 2      0       2  0.00   0.00   inf  inf
20.00                 2      0       2  0.00   0.00   inf  inf
                          #total   #bad   #good  %bad  %good   woe   iv
DEF_30_CNT_SOCIAL_CIRCLE                                               
-1.00                          1      0       1  0.00   0.00   inf  inf
0.00                      246054  19775  226279  0.85   0.89  0.04 0.00
1.00                       25885   2686   23199  0.12   0.09 -0.24 0.01
2.00                        4892    577    4315  0.02   0.02 -0.38 0.00
3.00                        1101    142     959  0.01   0.00 -0.49 0.00
4.00                         232     32     200  0.00   0.00 -0.56 0.00
5.00                          54      8      46  0.00   0.00 -0.65 0.00
6.00                          10      1       9  0.00   0.00 -0.20 0.00
7.00                           1      0       1  0.00   0.00   inf  inf
8.00                           1      0       1  0.00   0.00   inf  inf
                          #total   #bad   #good  %bad  %good   woe   iv
DEF_60_CNT_SOCIAL_CIRCLE                                               
-1.00                          1      0       1  0.00   0.00   inf  inf
0.00                      254692  20623  234069  0.89   0.92  0.03 0.00
1.00                       19908   2131   17777  0.09   0.07 -0.27 0.01
2.00                        2924    357    2567  0.02   0.01 -0.42 0.00
3.00                         561     93     468  0.00   0.00 -0.78 0.00
4.00                         122     14     108  0.00   0.00 -0.35 0.00
5.00                          19      3      16  0.00   0.00 -0.72 0.00
6.00                           3      0       3  0.00   0.00   inf  inf
7.00                           1      0       1  0.00   0.00   inf  inf
                      #total   #bad   #good  %bad  %good   woe   iv
NAME_INCOME_TYPE                                                   
Commercial associate   63652   4990   58662  0.21   0.23  0.07 0.00
Maternity leave            2      2       0  0.00   0.00  -inf  inf
Pensioner              51993   2842   49151  0.12   0.19  0.45 0.03
State servant          19836   1188   18648  0.05   0.07  0.36 0.01
Student                   15      0      15  0.00   0.00   inf  inf
Unemployed                15      8       7  0.00   0.00 -2.53 0.00
Working               142718  14191  128527  0.61   0.50 -0.19 0.02

In this scope of project, since there are too many variables can impact to model robust, threshold for IV value will be set at 0.05 to only allow fair predictive features. For feature has IV < 0.05 will be remove from final_features before fitting data to model. From above result, we can address the inf value caused from small group contain 0 data of good/bad record. Statistically, we can assume distribution of good/bad in these cases with small value 0.00000001. By that, selected feature based on correcting IV to be adding NAME_INCOME_TYPE with IV 0.06

In [103]:
#selecting feature has IV >= 0.1:
final_feature = []
for feature, iv in iv_values.items():
    if iv >= 0.01 and iv != float('inf'):
        final_feature.append(feature)

#update inspecting feature has IV = inf as above analysis:
final_feature.append('NAME_INCOME_TYPE')

#adding target
final_feature.append('TARGET')

df_filter = df[final_feature]

print(len(df_filter.columns))
32

now with this final set of 31 features, let's proceed to model fitting and selection

3. BUILDING MODEL¶

In this project, for classification problem, Logistic Regression will be used. Dataset now will be transformed to fit in model.

In [66]:
X = pd.get_dummies(df_filter.drop('TARGET',axis = 1),drop_first=True)
y = df_filter['TARGET']

To assessing model prediction, dataset will be splitted into 3 parts: 50% training, 25% test and 25% validation. With 31 features in various scale, we will also use StandardScaler to standardize the scale of feature character.

3.1 base model¶

In [88]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score


X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.5, random_state=42) #train plit
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) #test x validation split
X_train_scaled = StandardScaler().fit_transform(X_train)
X_test_scaled = StandardScaler().fit_transform(X_test)
X_val_scaled = StandardScaler().fit_transform(X_val)

model = LogisticRegression(random_state = 42, max_iter=1000) # Instantiate the model with balanced class_weight
model.fit(X_train_scaled,y_train) # Fit the model using training set

#have a look on current params
model.get_params()
Out[88]:
{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 1000,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 42,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

3.2 assessing base model:¶

In [89]:
#TEST SET
y_pred_t = model.predict(X_test_scaled)
y_pred_probs_t = model.predict_proba(X_test_scaled)[:, 1]
auc_roc_t = roc_auc_score(y_test, y_pred_probs_t)
CX_test = confusion_matrix(y_test, y_pred_t)


#VAL SET
y_pred_v = model.predict(X_val_scaled)
y_pred_probs_v = model.predict_proba(X_val_scaled)[:, 1]
auc_roc_v = roc_auc_score(y_val, y_pred_probs_v)
CX_val = confusion_matrix(y_val, y_pred_v)


#fpr_tr, tpr_tr, _ = roc_curve(y_train,  y_pred_probs_tr)
fpr_t, tpr_t, _ = roc_curve(y_test,  y_pred_probs_t)
fpr_v, tpr_v, _ = roc_curve(y_val,  y_pred_probs_v)

#plt.plot(fpr_tr,tpr_tr, label='AUC train_set:'+str(round(auc_roc_tr,5)))
plt.plot(fpr_t,tpr_t, label='AUC test_set:'+str(round(auc_roc_t,5)))
plt.plot(fpr_v,tpr_v, label='AUC val_set:'+str(round(auc_roc_v,5)))

plt.axline((0,0), slope=1,fillstyle='full', c='green', linestyle='-.', linewidth=0.5)

plt.title('AUC ROC Curve')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend()
plt.show() 

We can see the model is fitting to fair AUC score above 0.7 with describe method above. This score is above the targeting benchmark of 0.68. However, let's also try to look on tuning param by GridSearchCV with standard fold = 5, to see if we can finetuning to any better performance.

3.3 best model¶

In this step, we will tune param using GridSearchCV, target to achieve best model

In [104]:
param_grid = {
    'C': [0.001, 0.01, 1],  # Regularization parameter
    'solver': ['lbfgs','newton-cg', 'newton-cholesky', 'liblinear']       # Solver type
}

grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)

print("Best parameters:", grid_search.best_params_)

# Evaluate the model with best parameters on test data
best_model = grid_search.best_estimator_
y_pred_probs_b = best_model.predict_proba(X_test_scaled)[:, 1]
auc_roc_b = roc_auc_score(y_test, y_pred_probs_b)
print("AUC:", auc_roc_b)
Best parameters: {'C': 0.01, 'solver': 'liblinear'}
AUC: 0.7348482882976024
In [85]:
best_model.get_params()
Out[85]:
{'C': 0.01,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 1000,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 42,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

3.4 assessing best model¶

In [91]:
#evaluation on best_param

#TEST SET
y_pred_t = best_model.predict(X_test_scaled)
y_pred_probs_t = best_model.predict_proba(X_test_scaled)[:, 1]
auc_roc_t = roc_auc_score(y_test, y_pred_probs_t)
CX_test = confusion_matrix(y_test, y_pred_t)


#VAL SET
y_pred_v = best_model.predict(X_val_scaled)
y_pred_probs_v = best_model.predict_proba(X_val_scaled)[:, 1]
auc_roc_v = roc_auc_score(y_val, y_pred_probs_v)
CX_val = confusion_matrix(y_val, y_pred_v)


#fpr_tr, tpr_tr, _ = roc_curve(y_train,  y_pred_probs_tr)
fpr_t, tpr_t, _ = roc_curve(y_test,  y_pred_probs_t)
fpr_v, tpr_v, _ = roc_curve(y_val,  y_pred_probs_v)

#plt.plot(fpr_tr,tpr_tr, label='AUC train_set:'+str(round(auc_roc_tr,5)))
plt.plot(fpr_t,tpr_t, label='AUC test_set:'+str(round(auc_roc_t,5)))
plt.plot(fpr_v,tpr_v, label='AUC val_set:'+str(round(auc_roc_v,5)))

plt.axline((0,0), slope=1,fillstyle='full', c='green', linestyle='-.', linewidth=0.5)

plt.title('AUC ROC Curve')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend()
plt.show() 

4. MODEL EVALUATION¶

After param tuning, we observe no significant different between the base model using default value vs. tuning model using best_param. Therefore, we can interchangeably these 2 models.

This result likely happened when we did select proper set of feature through intensive analysis, careful feature engineering & feature scaling earlier, so that it contributes to learning progress initially, therefore the tuning param can help little on improving performance in Logistic Regression in binary class prediction.

To demonstrate the next analysis on predictive result using confusion matrix, as showing slightly improve comparing to base model, evaluation on result of best_model will be used.

In [83]:
cm_clf1 = CX_test
cm_clf2 = CX_val

# Define class labels
classes = ['good', 'bad']

# Plotting
fig, axs = plt.subplots(1, 2, figsize=(12, 6))  # 1 row, 2 columns

# Plot confusion matrix for classifier 1
axs[0].imshow(cm_clf1, cmap='Blues', interpolation='nearest')
axs[0].set(xticks=np.arange(cm_clf1.shape[1]),
           yticks=np.arange(cm_clf1.shape[0]),
           xticklabels=classes, yticklabels=classes,
           title='Confusion Matrix - test set',
           xlabel='Predicted label',
           ylabel='True label')
plt.setp(axs[0].get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")
for i in range(cm_clf1.shape[0]):
    for j in range(cm_clf1.shape[1]):
        axs[0].text(j, i, format(cm_clf1[i, j], 'd'),
                    ha="center", va="center",
                    color="white" if cm_clf1[i, j] > cm_clf1.max() / 2. else "black")

# Plot confusion matrix for classifier 2
axs[1].imshow(cm_clf2, cmap='Greens', interpolation='nearest')
axs[1].set(xticks=np.arange(cm_clf2.shape[1]),
           yticks=np.arange(cm_clf2.shape[0]),
           xticklabels=classes, yticklabels=classes,
           title='Confusion Matrix - Validation set',
           xlabel='Predicted label',
           ylabel='True label')
plt.setp(axs[1].get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")
for i in range(cm_clf2.shape[0]):
    for j in range(cm_clf2.shape[1]):
        axs[1].text(j, i, format(cm_clf2[i, j], 'd'),
                    ha="center", va="center",
                    color="white" if cm_clf2[i, j] > cm_clf2.max() / 2. else "black")

plt.tight_layout()
plt.show()

let's have a quick analysis on test set & validation set to see how the model

1. TEST SET:
   - original default rate = true_bad / total sample = (5693 + 55) / 69558 = 0.08264
   - default rate by best_model = (true_bad|predicted_good) / predict_good = 5693 / (5693 + 63748) = 0.08198

2. VALIDATION SET:
   - original default rate = true_bad / total sample = (5622 + 51) / 69558 = 0.08156
   - default rate by best_model = (true_bad|predicted_good) / predict_good = 5622 / (5622 + 63819) = 0.08096

Above result proved that using model can help to improve default rate ~0.06%.


CONCLUSION

Key take away from this project:

1. Data cleaning, method of feature selection and feature engineering play important role in build a good model.
<br>
2. Using model prediction can help to reduce ~0.06% default rate in this project scope.
<br>
3. Recommend to also considering the increase in rejection rate when using model to predict, compare potential loss of interest rate from rejected incorrect predicted bad customer vs. saving loss from correct predicted bad customer, to decide model release to production.

</font>